AI Scraper is a powerful tool that combines web scraping capabilities with AI-powered content parsing. It features a Streamlit interface, async web crawling, and intelligent content extraction powered by Ollama LLM.
- Simple and Advanced scraping modes
- Asynchronous web crawling with aiohttp
- AI-powered content extraction using Ollama LLM
- Captcha solving integration with Scraping Browser
- Configurable extraction strategies
- User-friendly interface with animated components
- Content chunking for efficient processing
- Python 3.7+
- Ollama installed and running with
llama3.1
model - Access to a Scraping Browser remote WebDriver
- Required Python packages (see Installation)
-
Clone the repository:
git clone https://github.com/newtglobalgit/ai-scraper.git cd ai-scraper
-
Create a virtual environment:
python -m venv ai source ai/bin/activate # On Windows, use `ai\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
-
Set up environment variables: Create a
.env
file in the project root and add:SBR_WEBDRIVER=<Your Scraping Browser WebDriver URL>
-
Ensure Ollama is running with the
llama3.1
model loaded. -
Start the Streamlit app:
streamlit run main.py
-
Using the application:
- Choose between Simple and Advanced scraping modes
- Enter a website URL to scrape
- In Advanced mode, provide specific extraction instructions
- View the scraped content in the expandable section
- Provide parsing instructions for AI-powered content extraction
- Review the parsed results
main.py
: Streamlit application with UI components and main workflowscrape.py
: Handles web scraping, async crawling, and content processingAsyncWebCrawler
: Asynchronous web crawling implementationScrapingResult
: Data class for crawling results- Content processing utilities (cleaning, chunking)
extraction_strategy.py
: Defines extraction strategy interface and LLM implementationExtractionStrategy
: Abstract base class for extraction strategiesLLMExtractionStrategy
: Ollama-based extraction implementation
parse.py
: Manages AI-powered content parsing using Ollama LLM
The system uses a flexible extraction strategy pattern:
class ExtractionStrategy(ABC):
@abstractmethod
async def extract(self, content):
pass
Asynchronous web crawling with context management:
async with AsyncWebCrawler(verbose=True) as crawler:
result = await crawler.arun(url, extraction_strategy)
- DOM content extraction and cleaning
- Content chunking for efficient processing
- Captcha solving integration
- Custom parsing with LLM
- Verify the
SBR_WEBDRIVER
environment variable is correctly set - Ensure Ollama is running and the
llama3.1
model is available - Check console output for error messages and debugging information
- For captcha-related issues, verify Scraping Browser configuration
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.