json_normalize does not normalize subrecords properly if any subrecords values are NoneType #20030

aerymilts · 2018-03-07T06:31:27Z

Code Sample, a copy-pastable example if possible

data_fail_to_normalize = \
        [{'info': None}, 
         
        {'info': 
         {'created_at': '11/08/1993', 'last_updated': '26/05/2012'},
        'author_name': 
         {'first': 'Jane', 'last_name': 'Doe'}
        }]

data_partial_fail = \
        [{'info': None, 
         'author_name': 
         {'first': 'Smith', 'last_name': 'Appleseed'}
        }, 
        
        {'info': 
         {'created_at': '11/08/1993', 'last_updated': '26/05/2012'},
        'author_name': 
         {'first': 'Jane', 'last_name': 'Doe'}
        }]

>>> import pandas as pd
>>> pd.io.json.json_normalize(data_fail_to_normalize)

Output 1

	author_name	info
0	nan	None
1	{'first': 'Jane', 'last_name': 'Doe'}	{'created_at': '11/08/1993', 'last_updated': '26/05/2012'}

>>> pd.io.json.json_normalize(data_partial_fail)

Output 2

	author_name.first	author_name.last_name	info	info.created_at	info.last_updated
0	Smith	Appleseed	nan	nan	nan
1	Jane	Doe	nan	11/08/1993	26/05/2012

Problem description

I expected that the json_normalize function takes into account the presence of NoneTypes in the dictionaries. This leads to 2 separate issues (If I should open this as 2 separate issues, let me know).

I have already written a fix that solves this issue - if anyone else can validate that this is not working as intended, I can set up a PR.

Output 1

Does not unnest json after encountering NoneType at first instance of subrecord, line 192 of pandas/io/json/normalize.py

Output 2

Keeps the None value when encountered, [{k: {'alpha': 'foo', 'beta': 'bar'}}, {k: None}], see nested_to_record function. Creates additional column of nans which would not otherwise occur if that particular key was removed.

Expected Output

Output 1

	author_name.first	author_name.last_name	info.created_at	info.last_updated
0	nan	nan	nan	nan
1	Jane	Doe	11/08/1993	26/05/2012

Output 2

	author_name.first	author_name.last_name	info.created_at	info.last_updated
0	Smith	Appleseed	nan	nan
1	Jane	Doe	11/08/1993	26/05/2012

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-104-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: None
numpy: 1.13.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-03-08T17:53:46Z

I would personally prefer the second option. Feel free to post the PR!

aerymilts · 2018-03-18T07:17:45Z

@gfyoung Were you referring to opening this as two separate issues?

gfyoung · 2018-03-18T07:41:37Z

@aerymilts : I was referring to the second output being preferable to the first.

…0030) TST: additional coverage for the test cases from (pandas-dev#20030) DOC: added changes to whatsnew/v0.23.0.txt (pandas-dev#20030)

)

…ame_describe * upstream/master: (158 commits) Add link to "Craft Minimal Bug Report" blogpost (pandas-dev#20431) BUG: fixed json_normalize for subrecords with NoneTypes (pandas-dev#20030) (pandas-dev#20399) BUG: ExtensionArray.fillna for scalar values (pandas-dev#20412) DOC" update the Pandas core window rolling count docstring" (pandas-dev#20264) DOC: update the pandas.DataFrame.plot.hist docstring (pandas-dev#20155) DOC: Only use ~ in class links to hide prefixes. (pandas-dev#20402) Bug: Allow np.timedelta64 objects to index TimedeltaIndex (pandas-dev#20408) DOC: add disallowing of Series construction of len-1 list with index to whatsnew (pandas-dev#20392) MAINT: Remove weird pd file DOC: update the Index.isin docstring (pandas-dev#20249) BUG: Handle all-NA blocks in concat (pandas-dev#20382) DOC: update the pandas.core.resample.Resampler.fillna docstring (pandas-dev#20379) BUG: Don't raise exceptions splitting a blank string (pandas-dev#20067) DOC: update the pandas.DataFrame.cummax docstring (pandas-dev#20336) DOC: update the pandas.core.window.x.mean docstring (pandas-dev#20265) DOC: update the api.types.is_number docstring (pandas-dev#20196) Fix linter (pandas-dev#20389) DOC: Improved the docstring of pandas.Series.dt.to_pytimedelta (pandas-dev#20142) DOC: update the pandas.Series.dt.is_month_end docstring (pandas-dev#20181) DOC: update the window.Rolling.min docstring (pandas-dev#20263) ...

…0030) (pandas-dev#20399)

gfyoung added the IO JSON read_json, to_json, json_normalize label Mar 8, 2018

gfyoung added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate IO JSON read_json, to_json, json_normalize and removed IO JSON read_json, to_json, json_normalize labels Mar 8, 2018

aerymilts mentioned this issue Mar 18, 2018

BUG: fixed json_normalize for subrecords with NoneTypes (#20030) #20399

Merged

4 tasks

jreback added this to the 0.23.0 milestone Mar 20, 2018

aerymilts added a commit to aerymilts/pandas that referenced this issue Mar 20, 2018

TST: updated test purposes and how nan is declared (pandas-dev#20030)

76d2d91

jreback closed this as completed in #20399 Mar 20, 2018

jreback pushed a commit that referenced this issue Mar 20, 2018

BUG: fixed json_normalize for subrecords with NoneTypes (#20030) (#20399

01882ba

)

dworvos pushed a commit to dworvos/pandas that referenced this issue Apr 2, 2018

BUG: fixed json_normalize for subrecords with NoneTypes (pandas-dev#2…

086c68d

…0030) (pandas-dev#20399)

This was referenced May 24, 2018

json_normalize gives KeyError in 0.23 #21158

Closed

BUG: Fix nested_to_record with None values in nested levels #21164

Merged

WillAyd mentioned this issue Jun 7, 2018

JSON nested_to_record Silently Drops Top-Level None Values #21356

Closed

wujiayikelly mentioned this issue Jul 9, 2018

json_normalize should supply empty columns if record_path are not present #21830

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

json_normalize does not normalize subrecords properly if any subrecords values are NoneType #20030

json_normalize does not normalize subrecords properly if any subrecords values are NoneType #20030

aerymilts commented Mar 7, 2018

INSTALLED VERSIONS

gfyoung commented Mar 8, 2018

Uh oh!

aerymilts commented Mar 18, 2018

Uh oh!

gfyoung commented Mar 18, 2018

Uh oh!

Uh oh!

json_normalize does not normalize subrecords properly if any subrecords values are NoneType #20030

json_normalize does not normalize subrecords properly if any subrecords values are NoneType #20030

Comments

aerymilts commented Mar 7, 2018

Code Sample, a copy-pastable example if possible

Output 1

Output 2

Problem description

Output 1

Output 2

Expected Output

Output 1

Output 2

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Mar 8, 2018

Uh oh!

aerymilts commented Mar 18, 2018

Uh oh!

gfyoung commented Mar 18, 2018

Uh oh!

Output of `pd.show_versions()`