You’re going to program a legal data scraper and process a sample data file. For example, you could be using Python to turn a PDF of police activities into JSON, or making recurring API calls to pull down files.
- It's legal.
- We can run the scraper by running one script, called
scraper.py
or at least beginningscraper-
- Populate the readme for your scraper with as much helpful information as you can!
- Include a truncated version of some sample data so we understand what is generated.
- Stick to the format of
USA/$STATE/$COUNTY/$MUNICIPALITY/$RECORD_TYPE
. If there is no specific county or municipality, you can skip those.
Browse our Data Sources and find a source to scrape. If the source you want to scrape isn't there, please add it yourself. This should take under 5 minutes.
- Clone this repository. Don't know how?
- Optionally,
cd
into the/setup_gui
directory. - Follow through the GUI.
- Mac: run the script with
python3 ScraperSetup.py
- Windows: run the executable by double clicking it.
- @Pythonidaer made an excellent walkthrough of the GUI as of the v0.0.1 release.
- Mac: run the script with
- Copy the resulting folder into your clone of
PDAP-Scrapers
.
/common
folder here!
/Base_Scripts
folder here!
Why start from scratch if we have a useful library? Keep in mind that we can always refactor your work later if necessary, so if you're not sure, we still want you to submit!
The most important thing here is that your scraper is grabbing public criminal justice records, and is legal.
Make sure you follow this guideline for creating folders:
COUNTRY/
STATE/
COUNTY/
DEPARTMENT_TYPE
(CITY)
(COUNTY)
(COLLEGE)
(STATE)
(FEDERAL)/
DEPARTMENT-X-NAME/
What kind of data are we scraping?
Police data that's already made public by a government jurisdiction.
What languages are allowed?
Python is preferred. If you use another language, we may not be able to easily fold it into our infrastructure.
Are there any specific formatting guidelines I should adhere to?
For now, if you use Python: Try to stick with PEP8 formatting. A good formatter for this is Black.