Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix bug when loading few columns of a dataset with many primary indices #446

Merged

Conversation

mlondschien
Copy link
Contributor

Description:

On master:

In [1]: import pandas as pd
   ...: import storefact
   ...: from functools import partial
   ...: from kartothek.io.eager import store_dataframes_as_dataset, read_table
   ...: 
   ...: path = "/tmp"
   ...: uuid = "dataset"
   ...: 
   ...: df = pd.DataFrame({"w": [-1, -2], "x": [1, 2], "y": ["a", "b"], "z": [0.0, 1.1]})
   ...: 
   ...: store = partial(storefact.get_store_from_url, f"hfs://{path}?create_if_missing=False")
   ...: dm = store_dataframes_as_dataset(
   ...:     dfs=[df],
   ...:     dataset_uuid=uuid,
   ...:     store=store,
   ...:     partition_on=["w", "x", "y"],
   ...:     overwrite=True,
   ...: )
   ...: 
   ...: df = read_table(dataset_uuid=uuid, store=store, columns=["y", "z"])
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-1-a5c31589801c> in <module>
     18 )
     19 
---> 20 df = read_table(dataset_uuid=uuid, store=store, columns=["y", "z"])

~/code/kartothek/kartothek/io/eager.py in read_table(dataset_uuid, store, columns, predicate_pushdown_to_io, categoricals, dates_as_object, predicates, factory)
    248         dataset_uuid=dataset_uuid, store=store, factory=factory,
    249     )
--> 250     partitions = read_dataset_as_dataframes(
    251         columns=columns,
    252         predicate_pushdown_to_io=predicate_pushdown_to_io,

~/code/kartothek/kartothek/io/eager.py in read_dataset_as_dataframes(dataset_uuid, store, columns, predicate_pushdown_to_io, categoricals, dates_as_object, predicates, factory, dispatch_by)
    135     )
    136 
--> 137     mps = read_dataset_as_metapartitions(
    138         columns=columns,
    139         predicate_pushdown_to_io=predicate_pushdown_to_io,

~/code/kartothek/kartothek/io/eager.py in read_dataset_as_metapartitions(dataset_uuid, store, columns, predicate_pushdown_to_io, categoricals, dates_as_object, predicates, factory, dispatch_by)
    202         dispatch_by=dispatch_by,
    203     )
--> 204     return list(ds_iter)
    205 
    206 

~/code/kartothek/kartothek/io/iter.py in read_dataset_as_metapartitions__iterator(dataset_uuid, store, columns, predicate_pushdown_to_io, categoricals, dates_as_object, predicates, factory, dispatch_by)
     87         else:
     88             mp = cast(MetaPartition, mp)
---> 89             mp = mp.load_dataframes(
     90                 store=store,
     91                 columns=columns,

~/code/kartothek/kartothek/io_components/metapartition.py in _impl(self, *method_args, **method_kwargs)
    148         else:
    149             for mp in self:
--> 150                 method_return = method(mp, *method_args, **method_kwargs)
    151                 if not isinstance(method_return, MetaPartition):
    152                     raise ValueError(

~/code/kartothek/kartothek/io_components/metapartition.py in load_dataframes(self, store, columns, predicate_pushdown_to_io, categoricals, dates_as_object, predicates)
    701         # Metadata version >=4 parse the index columns and add them back to the dataframe
    702 
--> 703         df = self._reconstruct_index_columns(
    704             df=df,
    705             key_indices=indices,

~/code/kartothek/kartothek/io_components/metapartition.py in _reconstruct_index_columns(self, df, key_indices, columns, categories, date_as_object)
    800                 if convert_to_date:
    801                     value = pd.Timestamp(value).to_pydatetime().date()
--> 802             df.insert(pos, primary_key, value)
    803 
    804         return df

~/anaconda3/envs/kartothek/lib/python3.9/site-packages/pandas/core/frame.py in insert(self, loc, column, value, allow_duplicates)
   3622         self._ensure_valid_index(value)
   3623         value = self._sanitize_column(column, value, broadcast=False)
-> 3624         self._mgr.insert(loc, column, value, allow_duplicates=allow_duplicates)
   3625 
   3626     def assign(self, **kwargs) -> "DataFrame":

~/anaconda3/envs/kartothek/lib/python3.9/site-packages/pandas/core/internals/managers.py in insert(self, loc, item, value, allow_duplicates)
   1204             self._blknos = np.append(self._blknos, len(self.blocks))
   1205         else:
-> 1206             self._blklocs = np.insert(self._blklocs, loc, 0)
   1207             self._blknos = np.insert(self._blknos, loc, len(self.blocks))
   1208 

<__array_function__ internals> in insert(*args, **kwargs)

~/anaconda3/envs/kartothek/lib/python3.9/site-packages/numpy/lib/function_base.py in insert(arr, obj, values, axis)
   4556         index = indices.item()
   4557         if index < -N or index > N:
-> 4558             raise IndexError(
   4559                 "index %i is out of bounds for axis %i with "
   4560                 "size %i" % (obj, axis, N))

IndexError: index 2 is out of bounds for axis 0 with size 1

In [2]: 

This PR fixes this.

  • Changelog entry

Copy link
Collaborator

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please provide a test for this? From the code changes alone, I would argue that the behaviour is identical

@mlondschien
Copy link
Contributor Author

The above code snipped could serve as a test. There are no tests for _reconstruct_index_columns, but I could add some.

@fjetter
Copy link
Collaborator

fjetter commented Apr 5, 2021

I'm just confused since I assume that the enumerate and the manual increase of the pos counter should be identical so I am not sure what this change is doing.

@mlondschien
Copy link
Contributor Author

We don't increase pos if the condition in L772 holds true, as in the above example. One could also always insert at pos=0, since IIUC the column order gets shuffled later anyways.

@mlondschien
Copy link
Contributor Author

@fjetter could you have another look at this?

@fjetter
Copy link
Collaborator

fjetter commented Apr 14, 2021

Sorry for the delay. Can you please rebase and ping when green?

@mlondschien mlondschien requested a review from fjetter April 14, 2021 12:37
@mlondschien
Copy link
Contributor Author

Test failures are also on master.

@fjetter fjetter merged commit 90ee486 into JDASoftwareGroup:master Apr 14, 2021
@mlondschien mlondschien deleted the only-insert-relevant-columns branch April 14, 2021 14:07
ilia-zaitcev-by added a commit to ilia-zaitcev-by/kartothek that referenced this pull request May 26, 2021
Revert "Bump codecov/codecov-action from v1.4.1 to v1.5.0 (JDASoftwareGroup#466)"
This reverts commit fdc9779.

Revert "fix mistakes in documentation"
This reverts commit 4e4b5e0.

Revert "Bump pre-commit/action from v2.0.0 to v2.0.3 (JDASoftwareGroup#460)"
This reverts commit d027ca2.

Revert "Bump codecov/codecov-action from v1.4.0 to v1.4.1 (JDASoftwareGroup#461)"
This reverts commit 97cd553.

Revert "Bump codecov/codecov-action from v1.3.1 to v1.4.0 (JDASoftwareGroup#458)"
This reverts commit e48d67a.

Revert "Fix bug when loading few columns of a dataset with many primary indices (JDASoftwareGroup#446)"
This reverts commit 90ee486.

Revert "Prepare release 4.0.1"
This reverts commit b278503.

Revert "Fix tests for dask dataframe and delayed backends"
This reverts commit 5520f74.

Revert "Add end-to-end regression test"
This reverts commit 8a3e6ae.

Revert "Fix dataset corruption after updates (JDASoftwareGroup#445)"
This reverts commit a26e840.

Revert "Set release date for 4.0"
This reverts commit 08a8094.

Revert "Return dask scalar for store and update from ddf (JDASoftwareGroup#437)"
This reverts commit 494732d.

Revert "Add tests for non-default table (JDASoftwareGroup#440)"
This reverts commit 3807a02.

Revert "Bump codecov/codecov-action from v1.2.2 to v1.3.1 (JDASoftwareGroup#441)"
This reverts commit f7615ec.

Revert "Set default for dates_as_object to True (JDASoftwareGroup#436)"
This reverts commit 75ffdb5.

Revert "Remove inferred indices (JDASoftwareGroup#438)"
This reverts commit b1e2535.

Revert "fix typo: 'KTK_CUBE_UUID_SEPERATOR' -> 'KTK_CUBE_UUID_SEPARATOR' (JDASoftwareGroup#422)"
This reverts commit b349cee.

Revert "Remove all deprecated arguments (JDASoftwareGroup#434)"
This reverts commit 74f0790.

Revert "Remove multi table feature (JDASoftwareGroup#431)"
This reverts commit 032856a.
ilia-zaitcev-by added a commit to ilia-zaitcev-by/kartothek that referenced this pull request Jun 11, 2021
Revert "Bump codecov/codecov-action from v1.4.1 to v1.5.0 (JDASoftwareGroup#466)"
This reverts commit fdc9779.

Revert "fix mistakes in documentation"
This reverts commit 4e4b5e0.

Revert "Bump pre-commit/action from v2.0.0 to v2.0.3 (JDASoftwareGroup#460)"
This reverts commit d027ca2.

Revert "Bump codecov/codecov-action from v1.4.0 to v1.4.1 (JDASoftwareGroup#461)"
This reverts commit 97cd553.

Revert "Bump codecov/codecov-action from v1.3.1 to v1.4.0 (JDASoftwareGroup#458)"
This reverts commit e48d67a.

Revert "Fix bug when loading few columns of a dataset with many primary indices (JDASoftwareGroup#446)"
This reverts commit 90ee486.

Revert "Prepare release 4.0.1"
This reverts commit b278503.

Revert "Fix tests for dask dataframe and delayed backends"
This reverts commit 5520f74.

Revert "Add end-to-end regression test"
This reverts commit 8a3e6ae.

Revert "Fix dataset corruption after updates (JDASoftwareGroup#445)"
This reverts commit a26e840.

Revert "Set release date for 4.0"
This reverts commit 08a8094.

Revert "Return dask scalar for store and update from ddf (JDASoftwareGroup#437)"
This reverts commit 494732d.

Revert "Add tests for non-default table (JDASoftwareGroup#440)"
This reverts commit 3807a02.

Revert "Bump codecov/codecov-action from v1.2.2 to v1.3.1 (JDASoftwareGroup#441)"
This reverts commit f7615ec.

Revert "Set default for dates_as_object to True (JDASoftwareGroup#436)"
This reverts commit 75ffdb5.

Revert "Remove inferred indices (JDASoftwareGroup#438)"
This reverts commit b1e2535.

Revert "fix typo: 'KTK_CUBE_UUID_SEPERATOR' -> 'KTK_CUBE_UUID_SEPARATOR' (JDASoftwareGroup#422)"
This reverts commit b349cee.

Revert "Remove all deprecated arguments (JDASoftwareGroup#434)"
This reverts commit 74f0790.

Revert "Remove multi table feature (JDASoftwareGroup#431)"
This reverts commit 032856a.
steffen-schroeder-by pushed a commit that referenced this pull request Jun 11, 2021
Revert "Bump codecov/codecov-action from v1.4.1 to v1.5.0 (#466)"
This reverts commit fdc9779.

Revert "fix mistakes in documentation"
This reverts commit 4e4b5e0.

Revert "Bump pre-commit/action from v2.0.0 to v2.0.3 (#460)"
This reverts commit d027ca2.

Revert "Bump codecov/codecov-action from v1.4.0 to v1.4.1 (#461)"
This reverts commit 97cd553.

Revert "Bump codecov/codecov-action from v1.3.1 to v1.4.0 (#458)"
This reverts commit e48d67a.

Revert "Fix bug when loading few columns of a dataset with many primary indices (#446)"
This reverts commit 90ee486.

Revert "Prepare release 4.0.1"
This reverts commit b278503.

Revert "Fix tests for dask dataframe and delayed backends"
This reverts commit 5520f74.

Revert "Add end-to-end regression test"
This reverts commit 8a3e6ae.

Revert "Fix dataset corruption after updates (#445)"
This reverts commit a26e840.

Revert "Set release date for 4.0"
This reverts commit 08a8094.

Revert "Return dask scalar for store and update from ddf (#437)"
This reverts commit 494732d.

Revert "Add tests for non-default table (#440)"
This reverts commit 3807a02.

Revert "Bump codecov/codecov-action from v1.2.2 to v1.3.1 (#441)"
This reverts commit f7615ec.

Revert "Set default for dates_as_object to True (#436)"
This reverts commit 75ffdb5.

Revert "Remove inferred indices (#438)"
This reverts commit b1e2535.

Revert "fix typo: 'KTK_CUBE_UUID_SEPERATOR' -> 'KTK_CUBE_UUID_SEPARATOR' (#422)"
This reverts commit b349cee.

Revert "Remove all deprecated arguments (#434)"
This reverts commit 74f0790.

Revert "Remove multi table feature (#431)"
This reverts commit 032856a.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants