Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Train/test data includes schools, hospitals, and other facility types #106

Open
vingkan opened this issue Mar 5, 2019 · 3 comments
Open

Comments

@vingkan
Copy link

vingkan commented Mar 5, 2019

According to the paper, inspections of hospitals and schools should not be included in the model train/test data. However, cross-referencing the model data with food inspection records from the Chicago data portal suggests that the model train/test data includes many different facility types, including hospitals and schools.

  • Is there a discrepancy between how food establishments are classified in the data portal and how they are classified in their business license?

You can reproduce my Jupyter notebook that checks the facility types by launching it in Binder:

Binder

@tomschenkjr pointed out that there are at least two locations in the code that should filter out other types of facilities:

This still leaves 1003 inspections with facility type listed as "Other". After cross-referencing with the data portal, 994 inspections appear to be facilities other than restaurants or grocery stores.

There also appear to be 11 inspections in the model train/test data that did not have a facility type in their record from the data portal query. Here is an excerpt from my query showing how I filtered the data portal records (SoQL):

--
WHERE inspection_type = "Canvass"
AND inspection_date >= "2011-01-01"
AND inspection_date <= "2014-11-01"
--
@geneorama
Copy link
Member

I get an error when I launch the "binder" link above.

Is there a discrepancy between how food establishments are classified in the data portal and how they are classified in their business license?

We join information about the business license at the time of inspection to the record of the inspection. We then filter the records to retain only "Retail Food Establishment" records.

As you noticed, a lot of business types (like schools and hospitals) are subject to food inspections. It's important to note that businesses have many license types. For example, some have liquor licenses alongside their retail food license, and others do not.

As you noticed, we only use inspections that have an associated business license description of "Retail Food Establishment".

As far as the "other" license types you're noticing, perhaps you're not looking at the licenses at the time of inspection? It could be that they dropped their food related license(s). For example, maybe it's a book shop that once also served / sold food, but now just sells books.

It's quite possible that you've found something, and I'll take a deeper look when we refactor the code, which should be happening in the next few months. The filtering is a little messy, and I think that this is something which will be fixed in the upcoming edits.

@vingkan
Copy link
Author

vingkan commented Mar 5, 2019

Hi @geneorama, I have updated the Binder link above. Here it is again:

Binder

It may take a while to load. In case there are still issues, here is a copy of the notebook.

@geneorama
Copy link
Member

Sorry, had a hard time following the Python and wasn't working on this project. Now that I'm back in it, I think I see what's going on.

We filter the business licenses were LICENSE_DESCRIPTION is Retail Food Establishment. Then we also use information about the facility_type which comes from food inspection data.

My understanding is that these are places that serve prepared food. However we do a lot of inspections in other places that sell packaged food or have kitchens.

I think that some of these retail food places are selling prepared foods in places like grocery stores. We do model the inspection of that prepared food, but we do not model the inspection of the packaged food, which is a separate license.

As I'm working on 2.0 I want to dig into this and be sure of the assumptions, so I'm glad you asked. The first time we did this I relied very heavily on prior art, but this time I want to understand it a bit more.

Before my talk at UseR! 2016, I performed some analysis to see what kinds of places are being inspected to get a list of all licenses that are inspected. As I recall, it wasn't as simple as I had hoped, and I couldn't find a clear cut rule for "this is a place that would get inspected". The best regex I found was searching for these terms in the license description "Retail Food|Consumption|Caterer|Food|Child". Then I grouped them together. My final count looked like this:

LICENSE_DESCRIPTION N
Retail Food Establishment 10910
Incidental Activity 2139
Wholesale Food Establishment 545
Caterer 192
Shared Kitchen 205
Mobile Food License 75
Children’s Services Facility License 817
Special Events 31

image

This is old data, I'm not sure how it would hold up with new license designations. Digging into that now.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants