AI Scraper

AI Scraper is a powerful tool that combines web scraping capabilities with AI-powered content parsing. It features a Streamlit interface, async web crawling, and intelligent content extraction powered by Ollama LLM.

Features

Simple and Advanced scraping modes
Asynchronous web crawling with aiohttp
AI-powered content extraction using Ollama LLM
Captcha solving integration with Scraping Browser
Configurable extraction strategies
User-friendly interface with animated components
Content chunking for efficient processing

Prerequisites

Python 3.7+
Ollama installed and running with llama3.1 model
Access to a Scraping Browser remote WebDriver
Required Python packages (see Installation)

Installation

Clone the repository:

git clone https://github.com/newtglobalgit/ai-scraper.git
cd ai-scraper

Create a virtual environment:

python -m venv ai
source ai/bin/activate  # On Windows, use `ai\Scripts\activate`

Install the required packages:
```
pip install -r requirements.txt
```
Set up environment variables: Create a .env file in the project root and add:
```
SBR_WEBDRIVER=<Your Scraping Browser WebDriver URL>
```

Usage

Ensure Ollama is running with the llama3.1 model loaded.
Start the Streamlit app:
```
streamlit run main.py
```
Using the application:
- Choose between Simple and Advanced scraping modes
- Enter a website URL to scrape
- In Advanced mode, provide specific extraction instructions
- View the scraped content in the expandable section
- Provide parsing instructions for AI-powered content extraction
- Review the parsed results

Project Structure

main.py: Streamlit application with UI components and main workflow
scrape.py: Handles web scraping, async crawling, and content processing
- AsyncWebCrawler: Asynchronous web crawling implementation
- ScrapingResult: Data class for crawling results
- Content processing utilities (cleaning, chunking)
extraction_strategy.py: Defines extraction strategy interface and LLM implementation
- ExtractionStrategy: Abstract base class for extraction strategies
- LLMExtractionStrategy: Ollama-based extraction implementation
parse.py: Manages AI-powered content parsing using Ollama LLM

Key Components

Extraction Strategy

The system uses a flexible extraction strategy pattern:

class ExtractionStrategy(ABC):
    @abstractmethod
    async def extract(self, content):
        pass

Async Web Crawler

Asynchronous web crawling with context management:

async with AsyncWebCrawler(verbose=True) as crawler:
    result = await crawler.arun(url, extraction_strategy)

Content Processing

DOM content extraction and cleaning
Content chunking for efficient processing
Captcha solving integration
Custom parsing with LLM

Troubleshooting

Verify the SBR_WEBDRIVER environment variable is correctly set
Ensure Ollama is running and the llama3.1 model is available
Check console output for error messages and debugging information
For captcha-related issues, verify Scraping Browser configuration

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Scraper

Features

Prerequisites

Installation

Usage

Project Structure

Key Components

Extraction Strategy

Async Web Crawler

Content Processing

Troubleshooting

Contributing

License

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.env.example		.env.example
LICENSE		LICENSE
README.md		README.md
extraction_strategy.py		extraction_strategy.py
main.py		main.py
parse.py		parse.py
requirements.txt		requirements.txt
run_streamlit_app.sh		run_streamlit_app.sh
scrape.py		scrape.py

License

newtglobalgit/ai-scrapper

Folders and files

Latest commit

History

Repository files navigation

AI Scraper

Features

Prerequisites

Installation

Usage

Project Structure

Key Components

Extraction Strategy

Async Web Crawler

Content Processing

Troubleshooting

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages