Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[C++][Parquet] Raise an error when reading Parquet data with invalid repetition levels #45185

Open
adamreeve opened this issue Jan 7, 2025 · 0 comments

Comments

@adamreeve
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

When looking into #45073 I found that Arrow doesn't raise an error when reading data with invalid repetition levels into Arrow list arrays.

The encryption test files included an int64 list column with leaf-values equal to i * 1,000,000,000,000, where i is the leaf-value index. The repetition level was set to 1 for even leaf indices and 0 for odd indices, meaning the first repetition level was 1 which is invalid. This file is read by PyArrow without any error being raised though, and the first leaf value (0) is skipped:

pyarrow.Table
int64_field: list<int64_field: int64 not null> not null
  child 0, int64_field: int64 not null
----
int64_field: [[[1000000000000,2000000000000],[3000000000000,4000000000000],...,[97000000000000,98000000000000],[99000000000000]]]

I wouldn't expect an error to be raised if reading the raw values and repetition levels with the lower-level Parquet C++ API, but think reading this data as an Arrow list should raise an error.

Component(s)

C++, Parquet

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

1 participant