-
Notifications
You must be signed in to change notification settings - Fork 1.5k
ParquetExec::statistics()
does not read statistics for many column types (like timstamps, strings, etc)
#8295
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
Note that the pruning predicate code does correctly read the statistics for other strings and timestamps, because it uses a different code path |
I plan to fix this |
Could I pick this ticket up? |
In |
I think there is some subtly related to decimals as well -- the best thing to do is probably to study what the existing code in row_groups does -- I think it is here https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L57 |
At some point there were multiple code paths to extract statistics in parquet (one for file level and one for row group level) that should likely be combined |
I believe we have fixed this with #10453 -- statistics are now correctly extracted
|
Describe the bug
While working on #8229 I found another bug that is non obvious, but that can be clearly seen now thanks to #8110 and #8111 from @NGA-TRAN
To Reproduce
And then look at the explain verbose up can see there are no min/max statisics shown:
Expected behavior
I expect there to be min/max values extracted in the statistics for the strings, as there are for integers (
(Col[0]: Min=Exact(Int64(1)) Max=Exact(Int64(3))
)Additional context
No response
The text was updated successfully, but these errors were encountered: