-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Regression: DataFrameWriteOptions::with_single_file_output
produces a directory
#13323
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
I agree this is a regression. Thank you for the callout @sergiimk I think this is a pretty good first issue for someone as the description is clear and the need is well defined. |
take |
It seems hard to control the behavior of
|
Did some digging and found this old PR #9041 (cc @yyy1000) that seems to have removed Looking at v42 code it does indeed seem that let single_file_output = !base_output_path.is_collection(); which in v43 became: let single_file_output = !base_output_path.is_collection() && base_output_path.file_extension().is_some(); The Personally I think that all kinds of extension-based heuristics don't belong in such low level code like Whichever heuristic version (pre v36, pre v43, or post v43) is the right one - I don't really mind, but I think there should be a way to skip it and specify explicitly. |
It seems that the previous PR intentionally removed the |
I feel like i reviewed a PR recently related to this issue but could not find it. I wonder if it is still valid
|
Describe the bug
Consider a snippet like this:
Before v43 this would write a single file called
data
, but in v43 this is creatingdata
as a directory with a randomly named file(s) in it.This seems to be related to #13079 (cc @dhegberg) that added an extension-based heuristic.
I see this as a regression, as single file output is requested explicitly, and I don't want a heuristics to be applied.
We are using Parquet files with a content-addressable file system and our files don't have extensions.
To Reproduce
See above
Expected behavior
Considering the introduction of the extension-based heuristic I would suggest the following behavior:
with_single_file_output
is not called (single_file_output == None
) - apply the heuristicwith_single_file_output(true)
- produce a single file at the exact path specifiedwith_single_file_output(false)
- create directory under specified path if doesn't exist and write one or many filesAdditional context
The text was updated successfully, but these errors were encountered: