A Python tool for scraping a set of repositories from GitHub to a MongoDB database.
To use the data which has been collected by gha
, you do not need to follow this readme and run it yourself, though you may still wish to if you want to collect a small local dataset for testing.
Instead, please see the project wiki page on using the data.
- Docker
- Python 3.7 or greater
- Clone this repository and
cd
into the cloned directory - Create and activate a virtual environment
- Install this package (
gha
) into the virtual environment
git clone https://github.com/Southampton-RSG/github-analysis.git
cd github-analysis
python3 -m venv venv
source venv/bin/activate
pip install .
- Create a GitHub personal access token at https://github.com/settings/tokens
- No permissions are required
- Populate a
.env
file from.env.template
- Start MongoDB database containers
docker-compose
can be installed withpip
if necessary
- Start
gha
scraper using a repo list file- Virtual environment created above must still be active
docker-compose up -d
gha fetch -f tests/data/UKRI_10.txt
The database web console can be accessed at http://localhost:8081/db/github/.