Skip to content

Consider adding BloomFilter reading support to ParquetMetadataReader #6514

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
alamb opened this issue Oct 6, 2024 · 3 comments
Open

Consider adding BloomFilter reading support to ParquetMetadataReader #6514

alamb opened this issue Oct 6, 2024 · 3 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Oct 6, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Parquet now has the wonderful ParquetMetaDataReader structure from @adriangb and @etseidl

This handles reading the footer metadata as well as the page indexes.

@progval noted in #6505 (comment) that BloomFilters are similiar to the PageIndex, but are not currently read/written by the ParquetMetaDataReader

Describe the solution you'd like
I would like to be able to configure the ParquetMetaDataReader (and writer) to read BloomFilters as well

Describe alternatives you've considered
This might look something like

// read parquet metadata including page indexes
let file = open_parquet_file("some_path.parquet");
let mut reader = ParquetMetaDataReader::new()
    .with_bloom_filters(true);
reader.try_parse(&file).unwrap();
let metadata = reader.finish().unwrap();
// Somehow get access to the bloom filters (not sure what that API would look like)

Additional context

@alamb alamb added parquet Changes to the parquet crate enhancement Any new improvement worthy of a entry in the changelog labels Oct 6, 2024
@etseidl
Copy link
Contributor

etseidl commented Oct 7, 2024

I can see two paths forward here. In the near term we could add convenience functions to ParquetMetaDataReader/Writer to allow fetching/writing the bloom filters given an existing ParquetMetaData.

Longer term, I think we'd want to move the bloom filters into the ParquetMetaData struct to enable @alamb's example above. This could be part of the larger refactoring of the metadata (#6129, #6097).

@alamb
Copy link
Contributor Author

alamb commented Oct 8, 2024

I can see two paths forward here. In the near term we could add convenience functions to ParquetMetaDataReader/Writer to allow fetching/writing the bloom filters given an existing ParquetMetaData.

Longer term, I think we'd want to move the bloom filters into the ParquetMetaData struct to enable @alamb's example above. This could be part of the larger refactoring of the metadata (#6129, #6097).

I agree with this breakdown -- I think the first (move the code to read/write bloom filters into ParquetMetaDataReader / ParquetMetaDataWriter) would likely be a step towards the second (and would consolidate the code in a single location), so if someone is interested in trying it that would be neat.

As for a major metadata revamp, that is a bit more disruptive in my mind as it would cause significant downstream churn. 🤔

@progval
Copy link
Contributor

progval commented Oct 9, 2024

@progval noted in #6505 (comment) that BloomFilters are similiar to the PageIndex, but are not currently read/written by the ParquetMetaDataReader

Not that similar though, because Bloom Filters are considerably larger and won't fit in RAM for a large dataset, so they shouldn't be cached (at least not by default)

@alamb alamb changed the title Consider adding BloomFilter reading support to Consider adding BloomFilter reading support to ParquetMetadataReader Nov 5, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

3 participants