Enhance link fetching by excluding CSS/JS files and implement extensi… #1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit introduces two significant improvements to the link fetching and scraping mechanism:
Exclusion of CSS and JS Files: The
fetch_links_from_url
function has been updated to filter out links that lead to CSS or JS files. This is achieved by implementing a regex pattern that matches URLs ending with.css
or.js
, considering potential query strings or fragments. This refinement ensures that the scraping process remains focused on HTML content, thereby increasing efficiency by not following links to resources that are not pertinent to the scraping objectives.Extension-Based Filtering in Scraping: Extension-based filtering has been integrated into the scraping logic. This allows the scraper to selectively process files based on their extensions, which can be specified via a command-line argument. This enhancement enables the scraper to target specific file types and exclude others, aligning the file discovery process with user-defined criteria and improving the overall relevance of the scraped data.
The implementation details include:
fetch_links_from_url
to exclude links that match.css
or.js
files.These improvements contribute to a more streamlined and targeted scraping process, reducing unnecessary processing and focusing on content that is most relevant to the user's needs.