Skip to content

json_normalize does not normalize subrecords properly if any subrecords values are NoneType #20030

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
aerymilts opened this issue Mar 7, 2018 · 3 comments · Fixed by #20399
Closed
Labels
Bug IO JSON read_json, to_json, json_normalize Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@aerymilts
Copy link
Contributor

Code Sample, a copy-pastable example if possible

data_fail_to_normalize = \
        [{'info': None}, 
         
        {'info': 
         {'created_at': '11/08/1993', 'last_updated': '26/05/2012'},
        'author_name': 
         {'first': 'Jane', 'last_name': 'Doe'}
        }]

data_partial_fail = \
        [{'info': None, 
         'author_name': 
         {'first': 'Smith', 'last_name': 'Appleseed'}
        }, 
        
        {'info': 
         {'created_at': '11/08/1993', 'last_updated': '26/05/2012'},
        'author_name': 
         {'first': 'Jane', 'last_name': 'Doe'}
        }]

>>> import pandas as pd
>>> pd.io.json.json_normalize(data_fail_to_normalize)

Output 1

author_name info
0 nan None
1 {'first': 'Jane', 'last_name': 'Doe'} {'created_at': '11/08/1993', 'last_updated': '26/05/2012'}
>>> pd.io.json.json_normalize(data_partial_fail)

Output 2

author_name.first author_name.last_name info info.created_at info.last_updated
0 Smith Appleseed nan nan nan
1 Jane Doe nan 11/08/1993 26/05/2012

Problem description

I expected that the json_normalize function takes into account the presence of NoneTypes in the dictionaries. This leads to 2 separate issues (If I should open this as 2 separate issues, let me know).

I have already written a fix that solves this issue - if anyone else can validate that this is not working as intended, I can set up a PR.

Output 1

Does not unnest json after encountering NoneType at first instance of subrecord, line 192 of pandas/io/json/normalize.py

Output 2

Keeps the None value when encountered, [{k: {'alpha': 'foo', 'beta': 'bar'}}, {k: None}], see nested_to_record function. Creates additional column of nans which would not otherwise occur if that particular key was removed.

Expected Output

Output 1

author_name.first author_name.last_name info.created_at info.last_updated
0 nan nan nan nan
1 Jane Doe 11/08/1993 26/05/2012

Output 2

author_name.first author_name.last_name info.created_at info.last_updated
0 Smith Appleseed nan nan
1 Jane Doe 11/08/1993 26/05/2012

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-104-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: None
numpy: 1.13.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added the IO JSON read_json, to_json, json_normalize label Mar 8, 2018
@gfyoung
Copy link
Member

gfyoung commented Mar 8, 2018

I would personally prefer the second option. Feel free to post the PR!

@gfyoung gfyoung added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate IO JSON read_json, to_json, json_normalize and removed IO JSON read_json, to_json, json_normalize labels Mar 8, 2018
@aerymilts
Copy link
Contributor Author

@gfyoung Were you referring to opening this as two separate issues?

@gfyoung
Copy link
Member

gfyoung commented Mar 18, 2018

@aerymilts : I was referring to the second output being preferable to the first.

aerymilts added a commit to aerymilts/pandas that referenced this issue Mar 18, 2018
…0030)

TST: additional coverage for the test cases from (pandas-dev#20030)

DOC: added changes to whatsnew/v0.23.0.txt (pandas-dev#20030)
@jreback jreback added this to the 0.23.0 milestone Mar 20, 2018
nehiljain added a commit to nehiljain/pandas that referenced this issue Mar 21, 2018
…ame_describe

* upstream/master: (158 commits)
  Add link to "Craft Minimal Bug Report" blogpost (pandas-dev#20431)
  BUG: fixed json_normalize for subrecords with NoneTypes (pandas-dev#20030) (pandas-dev#20399)
  BUG: ExtensionArray.fillna for scalar values (pandas-dev#20412)
  DOC" update the Pandas core window rolling count docstring" (pandas-dev#20264)
  DOC: update the pandas.DataFrame.plot.hist docstring (pandas-dev#20155)
  DOC: Only use ~ in class links to hide prefixes. (pandas-dev#20402)
  Bug: Allow np.timedelta64 objects to index TimedeltaIndex (pandas-dev#20408)
  DOC: add disallowing of Series construction of len-1 list with index to whatsnew (pandas-dev#20392)
  MAINT: Remove weird pd file
  DOC: update the Index.isin docstring (pandas-dev#20249)
  BUG: Handle all-NA blocks in concat (pandas-dev#20382)
  DOC: update the pandas.core.resample.Resampler.fillna docstring (pandas-dev#20379)
  BUG: Don't raise exceptions splitting a blank string (pandas-dev#20067)
  DOC: update the pandas.DataFrame.cummax docstring (pandas-dev#20336)
  DOC: update the pandas.core.window.x.mean docstring (pandas-dev#20265)
  DOC: update the api.types.is_number docstring (pandas-dev#20196)
  Fix linter (pandas-dev#20389)
  DOC: Improved the docstring of pandas.Series.dt.to_pytimedelta (pandas-dev#20142)
  DOC: update the pandas.Series.dt.is_month_end docstring (pandas-dev#20181)
  DOC: update the window.Rolling.min docstring (pandas-dev#20263)
  ...
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Bug IO JSON read_json, to_json, json_normalize Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants