GitHub - cameronwtaylor/job-scraper

Introduction

This project scrapes data from individual company job boards.

Goals

Learn the components of the Modern Data Stack
Long term: Create a curated repository of remote jobs to overcome traditional job aggregator challenges including
- Duplication of postings
- Lack of postings
- Inability to perform fine-grained searching

Supported job boards

GitLab

Installation

Prerequisites

Python is installed
git is installed

Initial Setup

Create a local folder to hold the job scraper and Airbyte repos
Open a terminal window from that folder or cd into the folder
Clone the job scraper repo (run git clone https://gitlab.com/cameronwtaylor/job-scraper.git in your terminal)

Create Virtual Environment

Navigate into the job scraper repo in your terminal
Run the following commands in your terminal
- python3 -m venv .venv (creates a virtual environment)
- python3 -m pip install -r requirements.txt (installs required dependencies)
- source .venv/bin/activate (activates the environment)

Install Docker and Airbyte

Open a terminal window from the local folder you created in Initial Setup or cd into the folder
Follow the official Deploy Airbyte instructions
Create a folder called gitlab inside tmp/airbyte_local (this is a hidden folder in the root directory of your machine)

If this is your first time installing Docker, you may need to open Docker Desktop before running docker-compose up to avoid errors.

Install Postgres

Pull a docker Postgres instance (run docker pull postgres in your terminal)
Run a Postgres instance in a Docker container (run docker run --name postgres-database -p 127.0.0.1:5432:5432 -e POSTGRES_PASSWORD=password -d postgres in your terminal)

Create Airbyte Components

Navigate to http://localhost:8000/
Create an Airbyte source for the GitLab departments file
Create an Airbyte source for the GitLab jobs file
Create an Airbyte destination for the local Postgres instance
Create a connection between the departments source and the destination
Create a connection between the jobs source and the destination
Retrieve the connection ID for each connection by looking at the UUID in the URL of the connection http://localhost:8000/workspaces/53dbc046-08cd-4a4a-b980-370a9c56833e/connections/b503b849-189e-47eb-b684-fdbe1221cd4c/status
Load the connection IDs into scrapers/github_scraper.py in the sync_gitlab_departments and sync_gitlab_jobs Dagster ops

Run the Scraper

Open a terminal from the job-scraper repo
Launch the Dagster UI (run dagit in your terminal)
Click Launchpad
Click Launch Run in the bottom right corner

Helpful Links

This project is basically a mash-up of these two tutorials:

If you want a database client to view your Postgres tables, DBeaver is one option.

Infrastructure

Scraping scripts - Python with requests and BeautifulSoup4
Extract/Load - Airbyte
Orchestration - Dagster
Storage - Postgres

Support

Please submit an issue for any support.

Roadmap

Short Term

Use dagster-dbt to add dbt transformation example to the pipeline
Persist dagit to preserve run history
Write scrapers for more job boards

Intermediate

Create a web app to display jobs

Long Term

Host this project on the cloud to automate data management and provide a public website
Create weekly email digest of new jobs

Contributing

Accepting all contributions!

License

MIT License

Project status

Active

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
images		images
scrapers		scrapers
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt
workspace.yaml		workspace.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Installation

Infrastructure

Support

Roadmap

Contributing

License

Project status

About

Releases

Packages

Languages

License

cameronwtaylor/job-scraper

Folders and files

Latest commit

History

Repository files navigation

Introduction

Installation

Infrastructure

Support

Roadmap

Contributing

License

Project status

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages