This is a concurrent web crawler implemented in Go. It allows you to crawl websites, extract links, and scrape specific data from the visited pages.
- Crawls web pages concurrently using goroutines
- Extracts links from the visited pages
- Scrapes data such as page title, meta description, meta keywords, headings, paragraphs, image URLs, external links, and table data from the visited pages
- Supports configurable crawling depth
- Handles relative and absolute URLs
- Tracks visited URLs to avoid duplicate crawling
- Provides timing information for the crawling process
- Saves the extracted data in a well-formatted CSV file
- Make sure you have Go installed on your system. You can download and install Go from the official website: https://golang.org
- Clone this repository to your local machine:
git clone https://github.com/sieep-coding/web-crawler.git
- Navigate to the project directory:
cd web-crawler
- Install the required dependencies:
go mod download
- Open a terminal and navigate to the project directory.
- Run the following command to start the web crawler:
Replace
go run main.go <url>
<url>
with the URL you want to crawl. - Wait for the crawling process to complete. The crawler will display the progress and timing information in the terminal.
- Once the crawling is finished, the extracted data will be saved in a CSV file named
crawl_results.csv
in the project directory.
You can customize the web crawler according to your needs:
- Modify the
processPage
function incrawler/page.go
to extract additional data from the visited pages using thegoquery
package. - Extend the
Crawler
struct incrawler/crawler.go
to include more fields for storing extracted data. - Customize the CSV file generation in
main.go
to match your desired format. - Implement rate limiting to avoid overloading the target website.
- Add support for handling robots.txt and respecting crawling restrictions.
- Integrate the crawler with a database or file storage to persist the extracted data.
This project is licensed under the UNLICENSE.