Skip to content

feat: improve string statistics display in datafusion-cli parquet_metadata function #8535

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 1 commit into from
Dec 14, 2023

Conversation

asimsedhain
Copy link
Contributor

Which issue does this PR close?

Closes #8464

Rationale for this change

What changes are included in this PR?

Output for the data_index_bloom_encoding_stats.parquet file
Datafusion
Screenshot 2023-12-13 at 10 00 09 PM
DuckDB
Screenshot 2023-12-13 at 10 00 36 PM

Are these changes tested?

Yes

Are there any user-facing changes?

Note

One thing I did notice while testing this was that, for parquet-testing/data/hadoop_lz4_compressed.parquet file, the output was still a byte array.
Screenshot 2023-12-13 at 10 17 50 PM

I checked the converted type was None for that column so, not sure if just blindly converting byte array into utf-8 string would be the right approach. Open to suggestions.

@alamb alamb changed the title feat: improve string statistics display feat: improve string statistics display in datafusion-cli parquet_metadata function Dec 14, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution @asimsedhain -- this looks great. I kicked off the CI run and if it passes I plan to merge this PR

I tried it out locally and it looks pretty sweet

select row_group_id, row_group_num_rows, path_in_schema, type, stats_min, stats_max, stats_null_count from parquet_metadata('/Users/andrewlamb/Software/arrow-datafusion/parquet-testing/data/alltypes_tiny_pages.parquet');
+--------------+--------------------+-------------------+------------+-----------+-------------------+------------------+
| row_group_id | row_group_num_rows | path_in_schema    | type       | stats_min | stats_max         | stats_null_count |
+--------------+--------------------+-------------------+------------+-----------+-------------------+------------------+
| 0            | 7300               | "id"              | INT32      | 0         | 7299              | 0                |
| 0            | 7300               | "bool_col"        | BOOLEAN    | false     | true              | 0                |
| 0            | 7300               | "tinyint_col"     | INT32      | 0         | 9                 | 0                |
| 0            | 7300               | "smallint_col"    | INT32      | 0         | 9                 | 0                |
| 0            | 7300               | "int_col"         | INT32      | 0         | 9                 | 0                |
| 0            | 7300               | "bigint_col"      | INT64      | 0         | 90                | 0                |
| 0            | 7300               | "float_col"       | FLOAT      | 0         | 9.9               | 0                |
| 0            | 7300               | "double_col"      | DOUBLE     | 0         | 90.89999999999999 | 0                |
| 0            | 7300               | "date_string_col" | BYTE_ARRAY | 01/01/09  | 12/31/10          | 0                |
| 0            | 7300               | "string_col"      | BYTE_ARRAY | 0         | 9                 | 0                |
| 0            | 7300               | "timestamp_col"   | INT96      |           |                   | 0                |
| 0            | 7300               | "year"            | INT32      | 2009      | 2010              | 0                |
| 0            | 7300               | "month"           | INT32      | 1         | 12                | 0                |
+--------------+--------------------+-------------------+------------+-----------+-------------------+------------------+
13 rows in set. Query took 0.012 seconds.

cc @Veeupup

@alamb alamb merged commit 1042095 into apache:main Dec 14, 2023
@alamb
Copy link
Contributor

alamb commented Dec 14, 2023

Thanks again @asimsedhain

@Veeupup
Copy link
Contributor

Veeupup commented Dec 18, 2023

Thanks @asimsedhain ! good job!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve string statistics display in datafusion-cli parquet_metadata
3 participants