Skip to content

Regression: DataFrameWriteOptions::with_single_file_output produces a directory #13323

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
sergiimk opened this issue Nov 9, 2024 · 6 comments
Assignees
Labels
bug Something isn't working good first issue Good for newcomers regression Something that used to work no longer does

Comments

@sergiimk
Copy link
Contributor

sergiimk commented Nov 9, 2024

Describe the bug

Consider a snippet like this:

df.write_parquet(
  "dir/data",
  DataFrameWriteOptions::new().with_single_file_output(true),
  None
).await

Before v43 this would write a single file called data, but in v43 this is creating data as a directory with a randomly named file(s) in it.

This seems to be related to #13079 (cc @dhegberg) that added an extension-based heuristic.

I see this as a regression, as single file output is requested explicitly, and I don't want a heuristics to be applied.

We are using Parquet files with a content-addressable file system and our files don't have extensions.

To Reproduce

See above

Expected behavior

Considering the introduction of the extension-based heuristic I would suggest the following behavior:

  • with_single_file_output is not called (single_file_output == None) - apply the heuristic
  • with_single_file_output(true) - produce a single file at the exact path specified
  • with_single_file_output(false) - create directory under specified path if doesn't exist and write one or many files

Additional context

@sergiimk sergiimk added the bug Something isn't working label Nov 9, 2024
@alamb alamb added regression Something that used to work no longer does good first issue Good for newcomers labels Nov 13, 2024
@alamb
Copy link
Contributor

alamb commented Nov 13, 2024

I agree this is a regression. Thank you for the callout @sergiimk

I think this is a pretty good first issue for someone as the description is clear and the need is well defined.

@irenjj
Copy link
Contributor

irenjj commented Nov 13, 2024

take

@irenjj
Copy link
Contributor

irenjj commented Nov 18, 2024

It seems hard to control the behavior of write_parquet by single_file_output(and I've noticed that It's never used), what really controls whether to generate a single file output is determining the suffix(in start_demuxer_task()), there are several methods I can think of to handle this issue:

  1. We can add a suffix like .single to the paths that require generating a single file, and then recognize this suffix in start_demuxer_task().
  2. Give up single_file_output in DataFrameWriteOptions, use FileSinkConfig instead to control single file behavior.

cc @alamb @sergiimk @dhegberg

@sergiimk
Copy link
Contributor Author

sergiimk commented Nov 19, 2024

Did some digging and found this old PR #9041 (cc @yyy1000) that seems to have removed single_file_output flag from FileSinkConfig - worth looking into it to understand the reasoning, not to undo the changes.

Looking at v42 code it does indeed seem that DataFrameWriteOptions::single_file_output is not read anywhere and the logic relies on just this condition in demux:

let single_file_output = !base_output_path.is_collection();

which in v43 became:

let single_file_output = !base_output_path.is_collection() && base_output_path.file_extension().is_some();

The DataFrameWriteOptions::new().with_single_file_output(true) is used in a bunch of tests though, so it's just a lucky coincidence that all tests give file a proper test.parquet name and not just test.

Personally I think that all kinds of extension-based heuristics don't belong in such low level code like start_demuxer_task and perhaps better left at the DataFrame level.

Whichever heuristic version (pre v36, pre v43, or post v43) is the right one - I don't really mind, but I think there should be a way to skip it and specify explicitly.

@irenjj
Copy link
Contributor

irenjj commented Nov 19, 2024

It seems that the previous PR intentionally removed the single_file_output option, but later introduced a heuristic.

@alamb
Copy link
Contributor

alamb commented Jan 13, 2025

I feel like i reviewed a PR recently related to this issue but could not find it. I wonder if it is still valid

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working good first issue Good for newcomers regression Something that used to work no longer does
Projects
None yet
Development

No branches or pull requests

3 participants