Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

FIPS code for New York City and Rhode Island Missing? #105

Closed
datamusing opened this issue Apr 1, 2020 · 10 comments
Closed

FIPS code for New York City and Rhode Island Missing? #105

datamusing opened this issue Apr 1, 2020 · 10 comments

Comments

@datamusing
Copy link

No description provided.

@MichaelKelleher
Copy link

county = 'New York City' is a manual aggregation in this data set (probably because The New York Times reports on metro areas, or perhaps because their data pre-aggregates the five borough/counties into a single 'New York City' data point(s). This means there is no fips code for 'New York City'

I set the fips code in the input file (us-counties.csv) to 0 for New York City and then, when joining other data (wherein fips code is called geo_id) I use :
CASE WHEN geo_id IN ( '36005', '36047', '36061', '36081', '36085' ) THEN '0' ELSE geo_id END geo_id
and aggregate any metrics (population, area_square_meters, etc.) by geo_id

@MichaelKelleher
Copy link

MichaelKelleher commented Apr 1, 2020

Thanks @datamusing for pointing out Rhode Island. I missed that:
select CASE
WHEN state = 'Rhode Island' AND county = 'Unknown' THEN 'Rhode Island'
WHEN state = 'Missouri' AND county IN ('Kansas City','Cass','Clay','Jackson','Platte') THEN 'Kansas City'
ELSE county END county,
CASE
WHEN county = 'New York City' THEN 0
WHEN state = 'Missouri' AND county in ('Kansas City','Cass','Clay','Jackson','Platte') THEN 2
WHEN state = 'Rhode Island' THEN 1
ELSE fips
END fips,
state,
date,
sum(cases) cases,
sum(deaths) deaths,
FROM cv.nyt_counties
GROUP BY 1,2,3,4) nc
LEFT OUTER JOIN
(select
CASE
WHEN geo_id IN ( '36005', '36047', '36061', '36081', '36085' ) THEN '0'
WHEN geo_id IN ( '44001', '44003', '44005', '44007', '44009' ) THEN '1'
WHEN geo_id IN ( '29037', '29047', '29095','29165' ) THEN '2'
ELSE geo_id
END geo_id,
sum(nonfamily_households + family_households ) households,
sum(total_pop) total_pop,
...
FROM bigquery-public-data.census_bureau_acs.county_2017_5yr
GROUP BY 1) pop
ON cast(pop.geo_id as int64) = cast(nc.fips as int64)

@datamusing
Copy link
Author

datamusing commented Apr 1, 2020 via email

@albertsun
Copy link
Collaborator

For Rhode Island, only the Unknown county will have a null FIPS code. See these rows below for 3/31.

2020-03-30,Bristol,Rhode Island,44001,14,0
2020-03-30,Kent,Rhode Island,44003,32,0
2020-03-30,Newport,Rhode Island,44005,21,0
2020-03-30,Providence,Rhode Island,44007,214,0
2020-03-30,Unknown,Rhode Island,,106,4
2020-03-30,Washington,Rhode Island,44009,28,0

@MichaelKelleher
Copy link

Thanks @albertsun. Also, awesome work. I'm just making adjustments (reaggregating at the level of a set of counties) so I can join on other fips datasets without having missing values. I lose some granularity, but I prefer that trade off to having cases / deaths that cannot be joined against population, demographcs, aggregated land area, etc.

@vicatcu
Copy link

vicatcu commented Apr 2, 2020

@MichaelKelleher et al given that there are FIPS codes pertinent to New York City... and given that NYC has official data at this level (see https://www1.nyc.gov/site/doh/covid/covid-19-data.page) why is New York City being aggregated into a single data point per day?

@MichaelKelleher
Copy link

@albertsun might know the answer here, but perhaps I can save him the effort with a little speculation (I'm guessing that, since he's at the NYT, he might have his hands full.)
Many other datasets (JHU, et al.) started collecting and making covid19 data available without much expectation of the interest in that data. Further, they provided no Service Level Agreement for those resources. Naturally, when JHU changed their schema, got embroiled in a political outrage at thier classifications of countries, realized they couldn't provide data at the original granularity, etc., users were, to put it mildly, annoyed. "You've broken thousands of models!"
Certainly, disseminators of resources can do a better or worse, more professional or less, job in managing data resources, but they also must make decisions, and in the case of covid19 data, I imagine with a grave sense of urgency.
The New York Times took on the daunting project of collecting, verifying, and (I didn't think they were savvy enough for it) making county level data available on github. git pull origin master and I have the latest data, for free, on my local instance. That is simply amazing to me.
The decision (conscious or not) to break the fips data model by manually inserting 'bogus' counties with no fips code, to me, falls under the category of petty annoyance. Annoying because it requires 8 lines of code and a fips look up. Petty because I'm a professional data analyst and I basically have to do similar things at least once a week with data created by engineers that, although they too have a sense of urgency, have more time to think through the ramifications of their small set of data consumers.
New York City, across it's five counties, is under tremendous resource and human strain and perhaps our heroes at the New York Times felt it was not worth the use of any resource to disambiguate cases into the appropriate fips-coded county, but simply to dump them all in the bucket of New York City. Perhaps their data source cannot verify at the fips level and they prefer to be accurate rather than precise. I don't care. I'm grateful for the data.
For the most part, in my own analysis, to date at least, the fips level of data can only be used for "vanity" metrics. I cannot infer much about the future, or see much about the to-date efficacy of shelter-in-place orders by shelter-tenure, for example, on a county level, so I'm aggregating anyway. I do, however, now get to choose the way in which I aggregate, urban vs. not, parts of states, metro areas, etc.
Why Rhode Island? Why Kansas City? For god's sake, I would love to know what is special about Joplin County. They did, at least, annotate in REAMDE the decisions they made. Hopefully someday we'll get to buy these data savages a beverage, thank them in person, and hear the stories behind all of these. For now, does anyone (not at the New York Times!) have the time, availability, and inclination to create a county level (fips, FTW!) shelter_effective_date .csv and make a pull request?
I hope this didn't come off as ranty, I'm truly grateful for all the great work, and heartened by the number of people that have interest in and ability with such data sets.

@datamusing
Copy link
Author

datamusing commented Apr 2, 2020 via email

@vicatcu
Copy link

vicatcu commented Apr 2, 2020

@MichaelKelleher yea I empathize, and I wasn't really complaining.

@MichaelKelleher
Copy link

I didn't assume it was a complaint, and I admitted to "petty annoyance" myself! Really just wanted to express some of my thoughts about what I imagine is going on in the data trenches and how I'm reacting to it. I'm on my third pipeline, have written 2,000 lines of code, built 20 predictive models 20 times each, developed new metrics (my latest favorite is three_days_in_a_row_death_growth_rate_decreasing for US states), and first thing I do in the morning is pull al the repos, build the data, and check my visualizations. I know more about world geography (especially population density) than ever!
All the best @vicatcu !

(You folks know of any more appropriate place to have these conversations?)

@tiffehr tiffehr closed this as completed Apr 4, 2023
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants