See it in action

Go Web Crawler

This is a concurrent web crawler implemented in Go. It allows you to crawl websites, extract links, and scrape specific data from the visited pages.

Crawls web pages concurrently using goroutines
Extracts links from the visited pages
Scrapes data such as page title, meta description, meta keywords, headings, paragraphs, image URLs, external links, and table data from the visited pages
Supports configurable crawling depth
Handles relative and absolute URLs
Tracks visited URLs to avoid duplicate crawling
Provides timing information for the crawling process
Saves the extracted data in a well-formatted CSV file

Make sure you have Go installed on your system. You can download and install Go from the official website: https://golang.org

Clone this repository to your local machine:

git clone https://github.com/sieep-coding/web-crawler.git

Open a terminal and navigate to the project directory.
Run the following command to start the web crawler:
```
go run main.go <url>
```
Replace <url> with the URL you want to crawl.
Wait for the crawling process to complete. The crawler will display the progress and timing information in the terminal.
Once the crawling is finished, the extracted data will be saved in a CSV file named crawl_results.csv in the project directory.

You can customize the web crawler according to your needs:

Modify the processPage function in crawler/page.go to extract additional data from the visited pages using the goquery package.
Extend the Crawler struct in crawler/crawler.go to include more fields for storing extracted data.
Customize the CSV file generation in main.go to match your desired format.
Implement rate limiting to avoid overloading the target website.
Add support for handling robots.txt and respecting crawling restrictions.
Integrate the crawler with a database or file storage to persist the extracted data.

This project is licensed under the UNLICENSE.