Botanica is a Python-based automation system for collecting, processing and managing data. It's designed to help cultivate a personal collection of data scrapers, or bots, with a shared ecosystem for scheduling, collection and cloud storage. This is a new project and very much a work in progress.
- Unified system: Manage all your scraper projects in one place without needing separate repositories.
- Automated scheduling: Use GitHub Actions to run tasks on a regular basis, from hourly to daily updates.
- Cloud integration: Easily upload processed data to AWS S3 or other cloud services.
- Modular design: Each scraper is self-contained, allowing for easy addition, removal or modification.
- Secure handling: Sensitive information like API keys and credentials are securely managed through GitHub Secrets.
- Flexible and scalable: Easily extend Botanica with new scrapers, configurations and data workflows.
Botanica’s structure is designed to keep things organized and straightforward. Each scraper lives in its own folder within bots
, and the utilities and workflows are kept in their own dedicated directories:
botanica/
├── bots/
│ └── example/
│ ├── src/
│ │ └── data/ # Where processed data is stored
│ ├── config.example.json # Template for bot-specific settings
│ ├── requirements.txt # Dependencies for the bot
│ └── main.py # Main script for the bot
├── .github/
│ └── workflows/
│ └── example-scraper.yml # GitHub Action for automating tasks
├── utils/
│ ├── s3_upload.py # Utility script for S3 uploads
│ └── logs/ # Directory for logs or temp files
└── .env.example # Example environment variables template
- Add a new bot: To create a new bot, run the
create_bot.py
script and follow the instructions below for customizing the scripts, configuration files and dependencies it needs. - Set up automation: Use the GitHub Actions workflow template to schedule your bot. Define when and how often it should run.
- Process and upload: When a bot runs, it processes data and saves it in the
src/data
directory. From there, the utility scripts can help you upload the results to cloud storage like AWS S3. - Manage configurations: Use your local environment or
.env
files and Github Actions secrets for sensitive information andconfig.json
files for bot-specific settings. Example templates are included to make setup easier.
Creating and deploying a new bot in Botanica is designed to be quick and easy. Follow the steps below to get your new data scraper up and running.
Before creating a new bot, ensure your environment is set up:
- Python 3.9+ installed
- Required packages installed (
pip install -r requirements.txt
) - AWS credentials configured if using cloud uploads
- Run
create_bot.py
: From the root of the project, run the following command:python create_bot.py
- Follow the prompts:
- Bot name: Enter a unique name for your new bot (e.g.,
weather_scraper
). - AWS profile name: Specify an AWS profile if you want to use a specific one (optional). Otherwise, press enter to use default environment credentials.
- Users/entities: Provide a list of entities or usernames separated by commas. This can be customized later.
- Bot name: Enter a unique name for your new bot (e.g.,
What this does:
- Copies the
template_bot
directory and sets up a new bot in./bots/<your_bot_name>
. - Updates
config.json
with your specified details. - Appends a log entry to the
README.md
with your new bot.
Navigate to the new bot directory:
cd bots/<your_bot_name>
-
Edit
main.py
:- Replace the placeholder scraping logic with your specific data extraction code. Modify the section marked
CUSTOM SCRAPING LOGIC STARTS HERE
. - Use the existing structure for data processing, storage, and upload, so you don’t need to worry about file handling or cloud integration.
- Replace the placeholder scraping logic with your specific data extraction code. Modify the section marked
-
Edit
config.json
(Optional):- Modify any parameters such as
output_directory
,timeseries_file
, orretry_attempts
as needed. - Update other settings like query parameters, API endpoints, or user lists.
- Modify any parameters such as
- Navigate to the bot’s directory:
cd bots/<your_bot_name>
- Install any additional dependencies:
If your bot requires packages not included in the main project’s
requirements.txt
, add them to the bot’srequirements.txt
file and install:pip install -r requirements.txt
- Run the bot:
Test your bot locally to ensure everything is working as expected:
python main.py
-
Create a workflow file:
- Copy the
template_bot.yml
file from.github/workflows/
or duplicate an existing bot’s.yml
file. - Rename it to
<your_bot_name>.yml
.
- Copy the
-
Customize the workflow: Update the new
.yml
file:- Name: Change the name to match your bot, e.g.,
name: weather_scraper
. - Paths and environment variables: Ensure that
BOT_NAME
,BOT_PATH
, and other environment variables reflect your new bot. - Schedule: Adjust the cron schedule to determine when your bot should run (e.g., daily, hourly).
- Name: Change the name to match your bot, e.g.,
-
Push changes to GitHub: After setting up the workflow, commit and push your changes:
git add . git commit -m "Add new bot: <your_bot_name>" git push origin main
-
Check GitHub Actions:
- Navigate to the Actions tab in your GitHub repository.
- Confirm that your new bot’s workflow is listed and has successfully run. If there are any errors, check the logs for troubleshooting.
-
Check the output:
- Verify that the processed data is correctly stored in the output directory you specified.
- Ensure the data has been uploaded to S3 (or your cloud storage), and confirm accessibility if needed.
name: example
on:
workflow_dispatch:
schedule:
- cron: '0 6 * * *' # Runs daily at 6 AM UTC
jobs:
build:
runs-on: ubuntu-latest
env:
BOT_NAME: "example"
BOT_PATH: "./bots/example"
PYTHONPATH: "${{ github.workspace }}"
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
- name: Install Python dependencies
run: |
python -m pip install --upgrade pip
pip install -r ${{ env.BOT_PATH }}/requirements.txt
- name: Run the scraper
run: |
python ${{ env.BOT_PATH }}/main.py
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: 'us-east-1'
- name: Pull latest changes
run: |
git config pull.rebase false
git pull origin main
- name: Commit updated data
run: |
git config --local user.email "action@github.com"
git config --local user.name "GitHub Action"
git add ${{ env.BOT_PATH }}/src/data/
git commit -m "Updated data" -a --allow-empty --author="stiles <stiles@users.noreply.github.com>"
- name: Push changes to main branch
run: |
git push origin main
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
- Automated cleanup: Add a feature to automatically archive or delete old files from S3.
- Comprehensive error handling: Improve the bot’s codebase to handle more specific errors and retry failed requests.
- Bot monitoring: A monitoring system that alerts if a bot encounters repeated failures.
- Automate inventory collection: Keep an updated list of bots, metadata and output paths.