This project scrapes data from individual company job boards.
Goals
- Learn the components of the Modern Data Stack
- Long term: Create a curated repository of remote jobs to overcome traditional job aggregator challenges including
- Duplication of postings
- Lack of postings
- Inability to perform fine-grained searching
Supported job boards
- GitLab
Prerequisites
Initial Setup
- Create a local folder to hold the job scraper and Airbyte repos
- Open a terminal window from that folder or
cd
into the folder - Clone the job scraper repo (run
git clone https://gitlab.com/cameronwtaylor/job-scraper.git
in your terminal)
Create Virtual Environment
- Navigate into the job scraper repo in your terminal
- Run the following commands in your terminal
python3 -m venv .venv
(creates a virtual environment)python3 -m pip install -r requirements.txt
(installs required dependencies)source .venv/bin/activate
(activates the environment)
Install Docker and Airbyte
- Open a terminal window from the local folder you created in Initial Setup or
cd
into the folder - Follow the official Deploy Airbyte instructions
- Create a folder called
gitlab
insidetmp/airbyte_local
(this is a hidden folder in the root directory of your machine)
If this is your first time installing Docker, you may need to open Docker Desktop before running docker-compose up
to avoid errors.
Install Postgres
- Pull a docker Postgres instance (run
docker pull postgres
in your terminal) - Run a Postgres instance in a Docker container (run
docker run --name postgres-database -p 127.0.0.1:5432:5432 -e POSTGRES_PASSWORD=password -d postgres
in your terminal)
Create Airbyte Components
- Navigate to http://localhost:8000/
- Create an Airbyte source for the GitLab departments file
- Create an Airbyte source for the GitLab jobs file
- Create an Airbyte destination for the local Postgres instance
- Create a connection between the departments source and the destination
- Create a connection between the jobs source and the destination
- Retrieve the connection ID for each connection by looking at the UUID in the URL of the connection
http://localhost:8000/workspaces/53dbc046-08cd-4a4a-b980-370a9c56833e/connections/b503b849-189e-47eb-b684-fdbe1221cd4c/status
- Load the connection IDs into
scrapers/github_scraper.py
in thesync_gitlab_departments
andsync_gitlab_jobs
Dagster ops
Run the Scraper
- Open a terminal from the
job-scraper
repo - Launch the Dagster UI (run
dagit
in your terminal) - Click
Launchpad
- Click
Launch Run
in the bottom right corner
Helpful Links
This project is basically a mash-up of these two tutorials:
- https://airbyte.com/tutorials/orchestrate-data-ingestion-and-transformation-pipelines
- https://airbyte.com/tutorials/data-scraping-with-airflow-and-beautiful-soup
If you want a database client to view your Postgres tables, DBeaver is one option.
- Scraping scripts - Python with requests and BeautifulSoup4
- Extract/Load - Airbyte
- Orchestration - Dagster
- Storage - Postgres
Please submit an issue for any support.
Short Term
- Use dagster-dbt to add dbt transformation example to the pipeline
- Persist dagit to preserve run history
- Write scrapers for more job boards
Intermediate
- Create a web app to display jobs
Long Term
- Host this project on the cloud to automate data management and provide a public website
- Create weekly email digest of new jobs
Accepting all contributions!
MIT License
Active