Skip to content

Speed up hash partitioning #6822

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
Dandandan opened this issue Jul 2, 2023 · 5 comments
Open

Speed up hash partitioning #6822

Dandandan opened this issue Jul 2, 2023 · 5 comments
Labels
enhancement New feature or request performance Make DataFusion faster

Comments

@Dandandan
Copy link
Contributor

Dandandan commented Jul 2, 2023

Is your feature request related to a problem or challenge?

Also see request in arrow apache/arrow-rs#4476

In DataFusion, a common operation is to repartition a RecordBatch by hashing one or more columns and dividing them into partition record batches using the "formula" hash % num_partitions.

The current approach is to create the indices that match and use them to take the individual arrays (see BatchPartitioner in datafusion).

This is relatively expensive however, as we visit the arrays num_partitions times in different places of the array, leading to cache inefficient operators (especially when the number of partitions is high).

Describe the solution you'd like

Faster hash-partitioning implementation

Describe alternatives you've considered

No response

Additional context

No response

@Dandandan Dandandan added the enhancement New feature or request label Jul 2, 2023
@Dandandan Dandandan changed the title Speed up partitioning operator Speed up hash partitioning operator Jul 2, 2023
@Dandandan Dandandan changed the title Speed up hash partitioning operator Speed up hash partitioning Jul 2, 2023
@alamb
Copy link
Contributor

alamb commented Jul 3, 2023

I recommend we look into implementing Selection Vectors / bitmaks -- then repartitioning could become a calculation of such filters/ bitmasks

@zebsme
Copy link
Contributor

zebsme commented Mar 22, 2025

hi @Dandandan @alamb, I tried some experiments by running tpch benchmarks.
And would like to share my findings for others who might be interested in this:

  1. Bitmask/filter is a bit slower than current implementation.
  2. Flattening the nested Vec can improve performance for some queries. However, for some other queries, it can actually slow things down, possibly due to increased memory or less efficient access patterns.
  3. Prefix sum requires random access,which leads to bad performance.

@alamb
Copy link
Contributor

alamb commented Mar 23, 2025

Thanks for checking this out @zebsme

I don't really have any other ideas

@Dandandan
Copy link
Contributor Author

I wrote some ideas of supporting selection vectors inside hash join and aggregate (I believe we didn't have those issues?)

This seems to be likely to give more substantial gains than trying to optimize only the partitioning code only as, even optimized, we still need to copy the inputs (and run CoalesceBatchesExec).

#15382
#15383

I think (at least for join, aggregate I am less certain) it might not be too hard to implement.

@alamb
Copy link
Contributor

alamb commented Mar 24, 2025

I believe the plans also effectively hash the group keys three times for aggregate plans:

  1. initial hash to find the group in the initial aggregate phase
  2. hash to compute the output partition
  3. hash (again) to find the group in the final aggregation phase

Passing along the pre-computed hashes, especially for strings, might be significiantly faster

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request performance Make DataFusion faster
Projects
None yet
Development

No branches or pull requests

3 participants