Skip to content

BUG: json_normalize generates TypeError: 'NoneType' object is not subscriptable as metadata object is not always present. #37783

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
sann05 opened this issue Nov 12, 2020 · 5 comments
Labels
Bug IO JSON read_json, to_json, json_normalize

Comments

@sann05
Copy link

sann05 commented Nov 12, 2020

The example below always generate an error TypeError: 'NoneType' object is not subscriptable as metadata object is not always present.

file = [{'values': ['1', '2'], 'metadata': {'name': 'first_value'}},
        {'values': ['3', '4'], 'metadata': None}]

df = json_normalize(file,
                    record_path='values',
                    meta=[['metadata', 'name']],
                    errors='ignore')

The metadata object is dynamic and nested, so I am concerned why it doesn't generate something like this:

0 metadata.name
1 first_value
2 first_value
3 nan
4 nan

as I set errors='ignore'

@WillAyd Could you please explain the best way to handle this?

Originally posted by @sann05 in #21830 (comment)

@sann05 sann05 changed the title json_normalize generates TypeError: 'NoneType' object is not subscriptable as metadata object is not always present. BUG: json_normalize generates TypeError: 'NoneType' object is not subscriptable as metadata object is not always present. Nov 12, 2020
@justinkraus
Copy link

did you find a workaround to this? Getting a similar error

@sann05
Copy link
Author

sann05 commented Jan 13, 2021

did you find a workaround to this? Getting a similar error

I copied json_normalize and nested_to_record function and reworked it as I found 1 more issue #37782

PS. Updated parts:

  1. Catching TypeError (except TypeError as e:)
  2. Changed np.array to pd.Series().values (it handles arrays field)
# Function was copied from pandas
def nested_to_record(
        ds,
        prefix: str = "",
        sep: str = ".",
        level: int = 0,
        max_level: Optional[int] = None,
):
    """
    A simplified json_normalize

    Converts a nested dict into a flat dict ("record"), unlike json_normalize,
    it does not attempt to extract a subset of the data.

    Parameters
    ----------
    ds : dict or list of dicts
    prefix: the prefix, optional, default: ""
    sep : str, default '.'
        Nested records will generate names separated by sep,
        e.g., for sep='.', { 'foo' : { 'bar' : 0 } } -> foo.bar
    level: int, optional, default: 0
        The number of levels in the json string.

    max_level: int, optional, default: None
        The max depth to normalize.

        .. versionadded:: 0.25.0

    Returns
    -------
    d - dict or list of dicts, matching `ds`

    Examples
    --------
    IN[52]: nested_to_record(dict(flat1=1,dict1=dict(c=1,d=2),
                                  nested=dict(e=dict(c=1,d=2),d=2)))
    Out[52]:
    {'dict1.c': 1,
     'dict1.d': 2,
     'flat1': 1,
     'nested.d': 2,
     'nested.e.c': 1,
     'nested.e.d': 2}
    """
    singleton = False
    if isinstance(ds, dict):
        ds = [ds]
        singleton = True
    new_ds = []
    for d in ds:
        new_d = copy.deepcopy(d)
        for k, v in d.items():
            # each key gets renamed with prefix
            if not isinstance(k, str):
                k = str(k)
            if level == 0:
                newkey = k
            else:
                newkey = prefix + sep + k

            # flatten if type is dict and
            # current dict level  < maximum level provided and
            # only dicts gets recurse-flattened
            # only at level>1 do we rename the rest of the keys
            if not isinstance(v, dict) or (
                    max_level is not None and level >= max_level
            ):
                if level != 0:  # so we skip copying for top level, common case
                    v = new_d.pop(k)
                    new_d[newkey] = v
                continue
            else:
                v = new_d.pop(k)
                new_d.update(nested_to_record(v, newkey, sep, level + 1, max_level))
        new_ds.append(new_d)

    if singleton:
        return new_ds[0]
    return new_ds


def json_normalize(
        data: Union[Dict, List[Dict]],
        record_path: Optional[Union[str, List]] = None,
        meta: Optional[Union[str, List[Union[str, List[str]]]]] = None,
        meta_prefix: Optional[str] = None,
        record_prefix: Optional[str] = None,
        errors: str = "raise",
        sep: str = ".",
        max_level: Optional[int] = None,
) -> "DataFrame":
    """
    Normalize semi-structured JSON data into a flat table.

    Parameters
    ----------
    data : dict or list of dicts
        Unserialized JSON objects.
    record_path : str or list of str, default None
        Path in each object to list of records. If not passed, data will be
        assumed to be an array of records.
    meta : list of paths (str or list of str), default None
        Fields to use as metadata for each record in resulting table.
    meta_prefix : str, default None
        If True, prefix records with dotted (?) path, e.g. foo.bar.field if
        meta is ['foo', 'bar'].
    record_prefix : str, default None
        If True, prefix records with dotted (?) path, e.g. foo.bar.field if
        path to records is ['foo', 'bar'].
    errors : {'raise', 'ignore'}, default 'raise'
        Configures error handling.

        * 'ignore' : will ignore KeyError if keys listed in meta are not
          always present.
        * 'raise' : will raise KeyError if keys listed in meta are not
          always present.
    sep : str, default '.'
        Nested records will generate names separated by sep.
        e.g., for sep='.', {'foo': {'bar': 0}} -> foo.bar.
    max_level : int, default None
        Max number of levels(depth of dict) to normalize.
        if None, normalizes all levels.

        .. versionadded:: 0.25.0

    Returns
    -------
    frame : DataFrame
    Normalize semi-structured JSON data into a flat table.

    Examples
    --------
    >>> data = [{'id': 1, 'name': {'first': 'Coleen', 'last': 'Volk'}},
    ...         {'name': {'given': 'Mose', 'family': 'Regner'}},
    ...         {'id': 2, 'name': 'Faye Raker'}]
    >>> pandas.json_normalize(data)
        id        name name.family name.first name.given name.last
    0  1.0         NaN         NaN     Coleen        NaN      Volk
    1  NaN         NaN      Regner        NaN       Mose       NaN
    2  2.0  Faye Raker         NaN        NaN        NaN       NaN

    >>> data = [{'id': 1,
    ...          'name': "Cole Volk",
    ...          'fitness': {'height': 130, 'weight': 60}},
    ...         {'name': "Mose Reg",
    ...          'fitness': {'height': 130, 'weight': 60}},
    ...         {'id': 2, 'name': 'Faye Raker',
    ...          'fitness': {'height': 130, 'weight': 60}}]
    >>> json_normalize(data, max_level=0)
                fitness                 id        name
    0   {'height': 130, 'weight': 60}  1.0   Cole Volk
    1   {'height': 130, 'weight': 60}  NaN    Mose Reg
    2   {'height': 130, 'weight': 60}  2.0  Faye Raker

    Normalizes nested data up to level 1.

    >>> data = [{'id': 1,
    ...          'name': "Cole Volk",
    ...          'fitness': {'height': 130, 'weight': 60}},
    ...         {'name': "Mose Reg",
    ...          'fitness': {'height': 130, 'weight': 60}},
    ...         {'id': 2, 'name': 'Faye Raker',
    ...          'fitness': {'height': 130, 'weight': 60}}]
    >>> json_normalize(data, max_level=1)
      fitness.height  fitness.weight   id    name
    0   130              60          1.0    Cole Volk
    1   130              60          NaN    Mose Reg
    2   130              60          2.0    Faye Raker

    >>> data = [{'state': 'Florida',
    ...          'shortname': 'FL',
    ...          'info': {'governor': 'Rick Scott'},
    ...          'counties': [{'name': 'Dade', 'population': 12345},
    ...                       {'name': 'Broward', 'population': 40000},
    ...                       {'name': 'Palm Beach', 'population': 60000}]},
    ...         {'state': 'Ohio',
    ...          'shortname': 'OH',
    ...          'info': {'governor': 'John Kasich'},
    ...          'counties': [{'name': 'Summit', 'population': 1234},
    ...                       {'name': 'Cuyahoga', 'population': 1337}]}]
    >>> result = json_normalize(data, 'counties', ['state', 'shortname',
    ...                                            ['info', 'governor']])
    >>> result
             name  population    state shortname info.governor
    0        Dade       12345   Florida    FL    Rick Scott
    1     Broward       40000   Florida    FL    Rick Scott
    2  Palm Beach       60000   Florida    FL    Rick Scott
    3      Summit        1234   Ohio       OH    John Kasich
    4    Cuyahoga        1337   Ohio       OH    John Kasich

    >>> data = {'A': [1, 2]}
    >>> json_normalize(data, 'A', record_prefix='Prefix.')
        Prefix.0
    0          1
    1          2

    Returns normalized data with columns prefixed with the given string.
    """

    def _pull_field(
            js: Dict[str, Any], spec: Union[List, str]
    ) -> Union[Scalar, Iterable]:
        """Internal function to pull field"""
        result = js
        if isinstance(spec, list):
            for field in spec:
                result = result[field]
        else:
            result = result[spec]
        return result

    def _pull_records(js: Dict[str, Any], spec: Union[List, str]) -> List:
        """
        Internal function to pull field for records, and similar to
        _pull_field, but require to return list. And will raise error
        if has non iterable value.
        """
        result = _pull_field(js, spec)

        # GH 31507 GH 30145, GH 26284 if result is not list, raise TypeError if not
        # null, otherwise return an empty list
        if not isinstance(result, list):
            if pd.isnull(result):
                result = []
            else:
                raise TypeError(
                    f"{js} has non list value {result} for path {spec}. "
                    "Must be list or null."
                )
        return result

    if isinstance(data, list) and not data:
        return DataFrame()

    # A bit of a hackjob
    if isinstance(data, dict):
        data = [data]

    if record_path is None:
        if any([isinstance(x, dict) for x in y.values()] for y in data):
            # naive normalization, this is idempotent for flat records
            # and potentially will inflate the data considerably for
            # deeply nested structures:
            #  {VeryLong: { b: 1,c:2}} -> {VeryLong.b:1 ,VeryLong.c:@}
            #
            # TODO: handle record value which are lists, at least error
            #       reasonably
            data = nested_to_record(data, sep=sep, max_level=max_level)
        return DataFrame(data)
    elif not isinstance(record_path, list):
        record_path = [record_path]

    if meta is None:
        meta = []
    elif not isinstance(meta, list):
        meta = [meta]

    _meta = [m if isinstance(m, list) else [m] for m in meta]

    # Disastrously inefficient for now
    records: List = []
    lengths = []

    meta_vals: DefaultDict = defaultdict(list)
    meta_keys = [sep.join(val) for val in _meta]

    def _recursive_extract(data, path, seen_meta, level=0):
        if isinstance(data, dict):
            data = [data]
        if len(path) > 1:
            for obj in data:
                for val, key in zip(_meta, meta_keys):
                    if level + 1 == len(val):
                        seen_meta[key] = _pull_field(obj, val[-1])

                _recursive_extract(obj[path[0]], path[1:], seen_meta,
                                   level=level + 1)
        else:
            for obj in data:
                recs = _pull_records(obj, path[0])
                recs = [
                    nested_to_record(r, sep=sep, max_level=max_level)
                    if isinstance(r, dict)
                    else r
                    for r in recs
                ]

                # For repeating the metadata later
                lengths.append(len(recs))
                for val, key in zip(_meta, meta_keys):
                    if level + 1 > len(val):
                        meta_val = seen_meta[key]
                    else:
                        try:
                            meta_val = _pull_field(obj, val[level:])
                        except KeyError as e:
                            if errors == "ignore":
                                meta_val = np.nan
                            else:
                                raise KeyError(
                                    "Try running with errors='ignore' as key "
                                    f"{e} is not always present"
                                ) from e
                        except TypeError as e:
                            if errors == "ignore":
                                meta_val = np.nan
                            else:
                                raise TypeError(
                                    f"{e}. "
                                    "Try running with errors='ignore' as meta "
                                    f"path {val[level:]} is not always present"
                                ) from e
                    meta_vals[key].append(meta_val)
                records.extend(recs)

    _recursive_extract(data, record_path, {}, level=0)

    result = DataFrame(records)

    if record_prefix is not None:
        result = result.rename(columns=lambda x: f"{record_prefix}{x}")

    # Data types, a problem
    for k, v in meta_vals.items():
        if meta_prefix is not None:
            k = meta_prefix + k

        if k in result:
            raise ValueError(
                f"Conflicting metadata name {k}, need distinguishing prefix "
            )
        result[k] = pd.Series(v, dtype=object).repeat(lengths).values
    return result

@hongzed
Copy link

hongzed commented Feb 16, 2021

encountering the exactly same issue here... thinking the errors="ignore" may handle it but unfortunately it didn't
thinking to capture the TypeError and coming up with a different meta definition in my specific case

@jbrockmendel jbrockmendel added Bug IO JSON read_json, to_json, json_normalize labels Mar 24, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jun 3, 2022
@simonjayhawkins
Copy link
Member

Thanks @sann05 for the report.

This is now fixed. (from pandas 1.4) in commit: [917d04f] BUG: closes #44312: fixes unwanted TypeError when a missing metadata field is missing (#44325)

#44312 was a duplicate of this issue.

tests added in #44325 so can just close this as fixed

@jkhas8
Copy link

jkhas8 commented Jun 7, 2022

How about the case when using record_path with None:

file = [{'values': None, 'metadata': {'name': 'first_value'}},
        {'values': ['3', '4'], 'metadata': None}]

df = json_normalize(file,
                    record_path='values',
                    meta=[['metadata', 'name']],
                    errors='ignore')

I got error: TypeError: 'NoneType' object is not iterable. How can I skip this?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

No branches or pull requests

6 participants