Skip to content

Increase ByteViewMap block size to 2MB #11674

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 2 commits into from
Jul 27, 2024

Conversation

XiangpengHao
Copy link
Contributor

@XiangpengHao XiangpengHao commented Jul 26, 2024

Targets the string-view2 branch

Which issue does this PR close?

Closes #.

Rationale for this change

Increase the default block size from 8KB to 2MB, this significantly improves the aggregation performance with high cardinality.
More details see apache/arrow-rs#6094

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@XiangpengHao
Copy link
Contributor Author

This the last piece to get the initial StringView support, testing on my machine shows that it can increase the string-intensive ClickBench performance by 20%-200%.

@@ -149,7 +149,7 @@ where
output_type,
map: hashbrown::raw::RawTable::with_capacity(INITIAL_MAP_CAPACITY),
map_size: 0,
builder: GenericByteViewBuilder::new(),
builder: GenericByteViewBuilder::new().with_block_size(2 * 1024 * 1024),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After apache/arrow-rs#6136 is merged, will this be needed anymore?

@alamb alamb merged commit 5690712 into apache:string-view2 Jul 27, 2024
25 checks passed
alamb added a commit that referenced this pull request Jul 29, 2024
… some ClickBench queries (not on by default) (#11667)

* Pin to pre-release version of arrow 52.2.0

* Update for deprecated method

* Add a config to force using string view in benchmark (#11514)

* add a knob to force string view in benchmark

* fix sql logic test

* update doc

* fix ci

* fix ci only test

* Update benchmarks/src/util/options.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update datafusion/common/src/config.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* update tests

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Add String view helper functions (#11517)

* add functions

* add tests for hash util

* Add ArrowBytesViewMap and ArrowBytesViewSet (#11515)

* Update `string-view` branch to arrow-rs main (#10966)

* Pin to arrow main

* Fix clippy with latest arrow

* Uncomment test that needs new arrow-rs to work

* Update datafusion-cli Cargo.lock

* Update Cargo.lock

* tapelo

* merge

* update cast

* consistent dep

* fix ci

* add more tests

* make doc happy

* update new implementation

* fix bug

* avoid unused dep

* update dep

* update

* fix cargo check

* update doc

* pick up the comments change again

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Enable `GroupValueBytesView` for aggregation with StringView types (#11519)

* add functions

* Update `string-view` branch to arrow-rs main (#10966)

* Pin to arrow main

* Fix clippy with latest arrow

* Uncomment test that needs new arrow-rs to work

* Update datafusion-cli Cargo.lock

* Update Cargo.lock

* tapelo

* merge

* update cast

* consistent dep

* fix ci

* avoid unused dep

* update dep

* update

* fix cargo check

* better group value view aggregation

* update

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Initial support for regex_replace on `StringViewArray` (#11556)

* initial support for string view regex

* update tests

* Add support for Utf8View for date/temporal codepaths (#11518)

* Add StringView support for date_part and make_date funcs

* run cargo update in datafusion-cli

* cargo fmt

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* GC `StringViewArray` in `CoalesceBatchesStream` (#11587)

* gc string view when appropriate

* make clippy happy

* address comments

* make doc happy

* update style

* Add comments and tests for gc_string_view_batch

* better herustic

* update test

* Update datafusion/physical-plan/src/coalesce_batches.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* [Bug] fix bug in return type inference of `utf8_to_int_type` (#11662)

* fix bug in return type inference

* update doc

* add tests

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Fix clippy

* Increase ByteViewMap block size to 2MB (#11674)

* better default block size

* fix related test

* Change `--string-view` to only apply to parquet formats (#11663)

* use inferenced schema, don't load schema again

* move config to parquet-only

* update

* update

* better format

* format

* update

* Implement native support StringView for character length (#11676)

* native support for character length

* Update datafusion/functions/src/unicode/character_length.rs

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Remove uneeded patches

* cargo fmt

---------

Co-authored-by: Xiangpeng Hao <haoxiangpeng123@gmail.com>
Co-authored-by: Xiangpeng Hao <me@haoxp.xyz>
Co-authored-by: Andrew Duffy <a10y@users.noreply.github.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants