Skip to content

cameronwtaylor/job-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

49 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

This project scrapes data from individual company job boards.

Goals

  • Learn the components of the Modern Data Stack
  • Long term: Create a curated repository of remote jobs to overcome traditional job aggregator challenges including
    • Duplication of postings
    • Lack of postings
    • Inability to perform fine-grained searching

Supported job boards

  • GitLab

Installation

Prerequisites

Initial Setup

  1. Create a local folder to hold the job scraper and Airbyte repos
  2. Open a terminal window from that folder or cd into the folder
  3. Clone the job scraper repo (run git clone https://gitlab.com/cameronwtaylor/job-scraper.git in your terminal)

Create Virtual Environment

  1. Navigate into the job scraper repo in your terminal
  2. Run the following commands in your terminal
    • python3 -m venv .venv (creates a virtual environment)
    • python3 -m pip install -r requirements.txt (installs required dependencies)
    • source .venv/bin/activate (activates the environment)

Install Docker and Airbyte

  1. Open a terminal window from the local folder you created in Initial Setup or cd into the folder
  2. Follow the official Deploy Airbyte instructions
  3. Create a folder called gitlab inside tmp/airbyte_local (this is a hidden folder in the root directory of your machine)

If this is your first time installing Docker, you may need to open Docker Desktop before running docker-compose up to avoid errors.

Install Postgres

  1. Pull a docker Postgres instance (run docker pull postgres in your terminal)
  2. Run a Postgres instance in a Docker container (run docker run --name postgres-database -p 127.0.0.1:5432:5432 -e POSTGRES_PASSWORD=password -d postgres in your terminal)

Create Airbyte Components

  1. Navigate to http://localhost:8000/
  2. Create an Airbyte source for the GitLab departments file Airbyte GitLab departments file source
  3. Create an Airbyte source for the GitLab jobs file Airbyte GitLab jobs file source
  4. Create an Airbyte destination for the local Postgres instance Airbyte local Postgres destination
  5. Create a connection between the departments source and the destination Airbyte GitLab departments connection
  6. Create a connection between the jobs source and the destination Airbyte GitLab jobs connection
  7. Retrieve the connection ID for each connection by looking at the UUID in the URL of the connection http://localhost:8000/workspaces/53dbc046-08cd-4a4a-b980-370a9c56833e/connections/b503b849-189e-47eb-b684-fdbe1221cd4c/status
  8. Load the connection IDs into scrapers/github_scraper.py in the sync_gitlab_departments and sync_gitlab_jobs Dagster ops

Run the Scraper

  1. Open a terminal from the job-scraper repo
  2. Launch the Dagster UI (run dagit in your terminal)
  3. Click Launchpad
  4. Click Launch Run in the bottom right corner

Helpful Links

This project is basically a mash-up of these two tutorials:

If you want a database client to view your Postgres tables, DBeaver is one option.

Infrastructure

Support

Please submit an issue for any support.

Roadmap

Short Term

  • Use dagster-dbt to add dbt transformation example to the pipeline
  • Persist dagit to preserve run history
  • Write scrapers for more job boards

Intermediate

  • Create a web app to display jobs

Long Term

  • Host this project on the cloud to automate data management and provide a public website
  • Create weekly email digest of new jobs

Contributing

Accepting all contributions!

License

MIT License

Project status

Active

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages