Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Support writing hive style partitioned files in COPY command #8493

Closed
alamb opened this issue Dec 11, 2023 · 5 comments · Fixed by #9240
Closed

Support writing hive style partitioned files in COPY command #8493

alamb opened this issue Dec 11, 2023 · 5 comments · Fixed by #9240
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Dec 11, 2023

Is your feature request related to a problem or challenge?

A user asked on ASF Slack: https://the-asf.slack.com/archives/C04RJ0C85UZ/p1702248979379239

Does the COPY command support creating parquet files that are partitioned using hive style partitioning?

The usecase is creating Hive-sty;e partitioned datasets (e.g as described here)

DataFusion does not support this today, but you can use an external table like this https://github.com/apache/arrow-datafusion/blob/93b21bdcd3d465ed78b610b54edf1418a47fc497/datafusion/sqllogictest/test_files/insert.slt#L45-L57

Describe the solution you'd like

@devinjdangelo notes that

The COPY statement does not have a built in PARTITION BY clause in its syntax currently, but we could support syntax like:

COPY table to 'folder/location' (format parquet, partition_by year)

which is the same syntax that duckdb supports for this.

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label Dec 11, 2023
@alamb
Copy link
Contributor Author

alamb commented Dec 11, 2023

I think this is a relative good project for intermediate contributors. It could be done in a few PRs as all the underlying code exists and we already have an example of writing to partitioned datasets, implementing this PR would be a matter of hooking up the APIs correctly

There are also already good examples of copy tests in https://github.com/apache/arrow-datafusion/blob/93b21bdcd3d465ed78b610b54edf1418a47fc497/datafusion/sqllogictest/test_files/copy.slt that can be extended

@alamb alamb added the good first issue Good for newcomers label Dec 11, 2023
@Veeupup
Copy link
Contributor

Veeupup commented Dec 11, 2023

hi @alamb Can I try this issue? It seems very interesting!

@alamb
Copy link
Contributor Author

alamb commented Dec 12, 2023

hi @alamb Can I try this issue? It seems very interesting!

Can't wait to see what you come up wtih @Veeupup 🚀

@JacobOgle
Copy link
Contributor

@Veeupup are you still working on this?

@Veeupup
Copy link
Contributor

Veeupup commented Jan 4, 2024

@JacobOgle Hi, sorry then! I have been a little busy these days and I'll start it lately ~

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants