Python scripts to process IRS 990 XML data
Work in progress. Background on the project
Read about the 990 data at the IRS's amazon page: https://aws.amazon.com/public-datasets/irs-990/
In short, the IRS has posted (as of March 1 2017) about 59 gigabytes of XML files that represent tax-exempt organization's Form 990 filings.
The filings are inventoried by year in JSON index files, with names like index_2012.json
. The filings themselves have names like 201017793492000000_public.xml
.
There are seventeen separate schemas used. Check out the archive.org mirror of the data to download just the .xsd
schema information -- it also has HTML-formatted diffs.
First, download the CSV of Ledger organizations into this directory.
pip install -r requirements.txt
To output a CSV, run OneWayToGetData()
in ipython after run final_xml_parser.py
Alternatively, see AnotherWayToGetData()
for an example of parsing a single index json as a stream.
Include tests alongside your modules by adding _test
to its name.
Run tests with nosetests
.
Get coverage reports for all modules by running:
nosetests --with-coverage --cover-package=`find . -name '*.py' | sed 's/^\.\///' | sed 's/\.py$//' | grep -v _test.py | paste -s -d, -`