-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Enable reading StringViewArray
by default from Parquet (8% improvement for entire ClickBench suite)
#13101
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Enable reading StringViewArray
by default from Parquet (8% improvement for entire ClickBench suite)
#13101
Conversation
cbdc592
to
c0ff96f
Compare
c0ff96f
to
c95b870
Compare
StringViewArray
by default from ParquetStringViewArray
by default from Parquet (8% improvement for entire ClickBench suite)
I think it's interesting to run some more Parquet benchmarks as well to detect any regression. It looks like query 18 of TPC-H is still a tiny bit slower maybe (ran it a few times in a row). The rest is as fast or faster:
|
Btw - I don't think this should hold off the merge / release, but would be good to track/note any regressions, however small. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably makes sense to run some more benchmarks just to be sure
/// If true, will use StringView/BinaryViewArray instead of String/BinaryArray | ||
/// when reading ParquetFiles | ||
#[structopt(long)] | ||
pub force_view_types: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we keep this (or a differently-named) flag as a kill-switch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 There is a kill switch (in the description of this PR)
set datafusion.execution.parquet.schema_force_view_types = false;
0 row(s) fetched.
Elapsed 0.000 seconds.
This particular code is for the benchmark drivers and I don't think it is super valuable to retain the benchmark in both configurations
I will do so |
My plan for this PR is to hedge against disruptions by making a stable DataFusion 42.2.0 and then merging this PR into the main for inclusion in #13065 I will review the benchmark results again and look at what is going on with TPCH Q18 |
I filed this one It may be an instance that Could help with |
This PR / project has been outstanding long enough and I desparately need to close off concurrent projects. Let's merge it in and keep iterating on main |
Thanks again @findepi @Dandandan (and @Rachelint and @goldmedal and @XiangpengHao and @jayzhan211 and so many others) |
…improvement for entire ClickBench suite) (apache#13101)" This reverts commit 2d7892b.
add `schema_force_view_types=true` in query session config as it improves query performance as explained here - apache/datafusion#13101
add `schema_force_view_types=true` in query session config as it improves query performance as explained here - apache/datafusion#13101
Replacement for #12092 which had too much history on it
Which issue does this PR close?
Closes #11682
Rationale for this change
Reading data as
StringViewArray
is significantly faster thanStringArray
. We have been testing this behind a feature flag but it is now stable enough to enable by default.See blog post #11603:
Benchmark Results
(note I believe the changes for Q1 and Q2 are noise (there is no corresponding changes for the
clickbench_partition
ed table)Details for `clickbench`
What changes are included in this PR?
schema_force_view_types
to trueAre these changes tested?
Yes, by CI tests
Are there any user-facing changes?
If you see an error related to StringView use, you can disable this feature using the schema_force_string_view option