Scraping, parsing and re-publishing University of Missouri Police Department Incident Reports
The University of Missour Police Department publishes data on its website about their cases. They're doing a great job keeping the data up-to-date, but there are a couple of problems:
-
The incident page has filter options to find specific kinds of incidents within a date range and/or at a specific address, which is nice. But some of the less common charges, like making a terrorist threat, aren't categorized under an incident type. Furthermore, not all cases originate from an incident report, so you won't even find those cases on this list.
-
The daily clery reports include every case and more information about each one, including the exact charges and the current disposition of the case. But, the daily reports are published as pdfs, which prevents any searching or analysis.
We can do better. Here's how:
- Download the daily clery reports;
- Extract the text from the pdf pages;
- Parse that text into a database;
- Build a web app for users to interact with this improved data.
- Python 2.7 +: An interpreted, object-oriented, high-level programming language;
- requests: For handling HTTP request;
- html5lib: For parsing HTML the same way any major browser would;
- beautifulsoup 4: For conveniently manipulating the parsed HTML.