Skip to content

Write a blog post fast Vectorized grouping for high cardinality #6988

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
Tracked by #6889
alamb opened this issue Jul 16, 2023 · 4 comments · Fixed by apache/arrow-site#386
Closed
Tracked by #6889

Write a blog post fast Vectorized grouping for high cardinality #6988

alamb opened this issue Jul 16, 2023 · 4 comments · Fixed by apache/arrow-site#386
Assignees

Comments

@alamb
Copy link
Contributor

alamb commented Jul 16, 2023

The idea here is to write a blog post explaining / motivating the improvement in DataFusion grouping made in #6904

@alamb alamb changed the title Write a blog post about it Write a blog post fast Vectorized grouping for high cardinality Jul 16, 2023
@alamb alamb self-assigned this Jul 16, 2023
@alamb alamb added the devrel label Jul 24, 2023
@alamb
Copy link
Contributor Author

alamb commented Jul 24, 2023

I have drafted a blog about this with @tustvold and @Dandandan -- it will be published on the InfluxData blog first and then I will propose reposting it on the arrow blog site. I expect to have a draft up later this week

@alamb
Copy link
Contributor Author

alamb commented Aug 2, 2023

here is a blog we wrote about how to do high cardinality grouping really fast: https://www.influxdata.com/blog/aggregating-millions-groups-fast-apache-arrow-datafusion/

I will propose a PR to cross-post the content to the arrow blog as well in the coming days

@alamb
Copy link
Contributor Author

alamb commented Aug 5, 2023

PR on arrow-site ready: apache/arrow-site#386

alamb added a commit to apache/arrow-site that referenced this issue Aug 14, 2023
…ion 28.0.0 (#386)

Closes apache/datafusion#6988

**Note**: This describes work @tustvold @Dandandan and I did in
DataFusion 28.0.0. This content was originally published on the
[InfluxData
Blog](https://www.influxdata.com/blog/aggregating-millions-groups-fast-apache-arrow-datafusion/)
but since it is general applicable to Apache Arrow DataFusion I would
like to syndicate it here becase:
1. This is a form where the community can comment / keep it up to date
via PR
2. It is hosted on a platform with a different lifetime than a company
blog

This is the same model we followed with
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
which was also republished on the arrow blog after the InfluxData blog

It also gives me an example to use my original ASCII art diagrams :)
@alamb
Copy link
Contributor Author

alamb commented Aug 14, 2023

It is now re-published on https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant