Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Enhance link fetching by excluding CSS/JS files and implement extensi… #1

Merged
merged 1 commit into from
Nov 2, 2023

Conversation

franckferman
Copy link
Owner

This commit introduces two significant improvements to the link fetching and scraping mechanism:

  1. Exclusion of CSS and JS Files: The fetch_links_from_url function has been updated to filter out links that lead to CSS or JS files. This is achieved by implementing a regex pattern that matches URLs ending with .css or .js, considering potential query strings or fragments. This refinement ensures that the scraping process remains focused on HTML content, thereby increasing efficiency by not following links to resources that are not pertinent to the scraping objectives.

  2. Extension-Based Filtering in Scraping: Extension-based filtering has been integrated into the scraping logic. This allows the scraper to selectively process files based on their extensions, which can be specified via a command-line argument. This enhancement enables the scraper to target specific file types and exclude others, aligning the file discovery process with user-defined criteria and improving the overall relevance of the scraped data.

The implementation details include:

  • A regex pattern in fetch_links_from_url to exclude links that match .css or .js files.
  • Modification of the file discovery logic to incorporate a dynamic list of desired file extensions for targeted scraping.

These improvements contribute to a more streamlined and targeted scraping process, reducing unnecessary processing and focusing on content that is most relevant to the user's needs.

…on-based filtering.

This commit introduces two significant improvements to the link fetching and scraping mechanism:

1. Exclusion of CSS and JS Files:
The `fetch_links_from_url` function has been updated to filter out links that lead to CSS or JS files. This is achieved by implementing a regex pattern that matches URLs ending with `.css` or `.js`, considering potential query strings or fragments. This refinement ensures that the scraping process remains focused on HTML content, thereby increasing efficiency by not following links to resources that are not pertinent to the scraping objectives.

2. Extension-Based Filtering in Scraping:
Extension-based filtering has been integrated into the scraping logic. This allows the scraper to selectively process files based on their extensions, which can be specified via a command-line argument. This enhancement enables the scraper to target specific file types and exclude others, aligning the file discovery process with user-defined criteria and improving the overall relevance of the scraped data.

The implementation details include:

- A regex pattern in `fetch_links_from_url` to exclude links that match `.css` or `.js` files.
- Modification of the file discovery logic to incorporate a dynamic list of desired file extensions for targeted scraping.

These improvements contribute to a more streamlined and targeted scraping process, reducing unnecessary processing and focusing on content that is most relevant to the user's needs.
@franckferman franckferman merged commit 0d01cdc into stable Nov 2, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant