Skip to content

feat: Support Parquet writer options #1123

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

nuno-faria
Copy link

@nuno-faria nuno-faria commented May 5, 2025

Which issue does this PR close?

N/A.

Rationale for this change

Supporting all Parquet writer options allows us more flexibility when creating data directly from datafusion-python.

For consistency, it supports all writer options defined by ParquetOptions in datafusion, using the same defaults: https://github.com/apache/datafusion/blob/555fc2e24dd669e44ac23a9a1d8406f4ac58a9ed/datafusion/common/src/config.rs#L423.

What changes are included in this PR?

  • Extended write_parquet with all writer options, including column-specific options.
  • Added relevant tests. (Since pyarrow does not expose page-level information, some options could not be directly tested, like enabling bloom-filters (an external tool confirmed that this option works). For this specific case, there is a test that compares the file sizes.)

Are there any user-facing changes?

The main difference relates to the existing compression field, which now uses a str like datafusion, instead of a custom enum. The main advantage is that future algorithms will not require updating the Python-side code.

Additionally, the default compression was changed from zstd(4) to zstd(3), the same as datafusion.

Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I really like the idea. Right now this does include a breaking change to a very popular user facing function. I think if we make the suggestion to allow for two function signatures we'll be able to include this in the next release.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants