-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Change mapping of SQL VARCHAR
from Utf8
to Utf8View
#15096
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
Please add comments if you find other needed items / issues |
To begin this project so that we can implement it incrementally, I suggest we create a new config option like |
Thank you @alamb , this is a great suggestion! And we finally can make it default to true when we finish all tasks! |
I will try to create more sub-tasks related to this effort! |
I also testing the tcph when it use the utf8view default, here is the result: ./benchmarks/bench.sh compare main issue_14909
Comparing main and issue_14909
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main ┃ issue_14909 ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1 │ 72.78ms │ 70.32ms │ no change │
│ QQuery 2 │ 27.64ms │ 26.30ms │ no change │
│ QQuery 3 │ 37.75ms │ 38.22ms │ no change │
│ QQuery 4 │ 27.97ms │ 28.27ms │ no change │
│ QQuery 5 │ 52.59ms │ 49.36ms │ +1.07x faster │
│ QQuery 6 │ 20.81ms │ 20.66ms │ no change │
│ QQuery 7 │ 70.44ms │ 75.06ms │ 1.07x slower │
│ QQuery 8 │ 48.32ms │ 49.02ms │ no change │
│ QQuery 9 │ 62.60ms │ 63.14ms │ no change │
│ QQuery 10 │ 55.94ms │ 58.75ms │ 1.05x slower │
│ QQuery 11 │ 19.44ms │ 21.21ms │ 1.09x slower │
│ QQuery 12 │ 36.59ms │ 37.42ms │ no change │
│ QQuery 13 │ 34.05ms │ 34.88ms │ no change │
│ QQuery 14 │ 26.50ms │ 26.77ms │ no change │
│ QQuery 15 │ 42.97ms │ 45.06ms │ no change │
│ QQuery 16 │ 19.25ms │ 20.02ms │ no change │
│ QQuery 17 │ 73.64ms │ 68.81ms │ +1.07x faster │
│ QQuery 18 │ 96.62ms │ 95.08ms │ no change │
│ QQuery 19 │ 46.77ms │ 45.75ms │ no change │
│ QQuery 20 │ 45.54ms │ 40.98ms │ +1.11x faster │
│ QQuery 21 │ 95.29ms │ 95.19ms │ no change │
│ QQuery 22 │ 18.34ms │ 17.99ms │ no change │
└──────────────┴─────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main) │ 1031.87ms │
│ Total Time (issue_14909) │ 1028.26ms │
│ Average Time (main) │ 46.90ms │
│ Average Time (issue_14909) │ 46.74ms │
│ Queries Faster │ 3 │
│ Queries Slower │ 3 │
│ Queries with No Change │ 16 │
└────────────────────────────┴───────────┘ |
Create the ticket for avro:
|
New sub_task:
PR: #15152 |
New sub_task:
|
New sub_task:
Submitted a PR: |
New sub_task:
|
Yes, 100% |
Submitted the PR for review: |
New sub_task:
|
Updated: Most of the tasks are resolved, i am trying to do more performance investigation and testing if we default change to Utf8View for all varchar. |
Also updated the latest clickbench for the current main compare the default mapping varchar to utf8view: Small improvement, i think becasue it's parquet format, mostly we already load it as the Utf8View for benchmark: Using --profile release-nonlto result: ┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ tmp ┃ tmp ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 0.36ms │ 0.32ms │ +1.13x faster │
│ QQuery 1 │ 27.55ms │ 28.53ms │ no change │
│ QQuery 2 │ 54.73ms │ 58.73ms │ 1.07x slower │
│ QQuery 3 │ 49.67ms │ 51.85ms │ no change │
│ QQuery 4 │ 321.47ms │ 333.88ms │ no change │
│ QQuery 5 │ 374.77ms │ 370.26ms │ no change │
│ QQuery 6 │ 26.06ms │ 27.14ms │ no change │
│ QQuery 7 │ 29.89ms │ 28.19ms │ +1.06x faster │
│ QQuery 8 │ 397.90ms │ 372.24ms │ +1.07x faster │
│ QQuery 9 │ 583.40ms │ 599.98ms │ no change │
│ QQuery 10 │ 148.50ms │ 147.43ms │ no change │
│ QQuery 11 │ 163.20ms │ 165.56ms │ no change │
│ QQuery 12 │ 399.31ms │ 407.08ms │ no change │
│ QQuery 13 │ 568.61ms │ 576.26ms │ no change │
│ QQuery 14 │ 389.05ms │ 374.46ms │ no change │
│ QQuery 15 │ 375.66ms │ 370.85ms │ no change │
│ QQuery 16 │ 720.94ms │ 719.03ms │ no change │
│ QQuery 17 │ 662.21ms │ 638.33ms │ no change │
│ QQuery 18 │ 1694.34ms │ 1507.92ms │ +1.12x faster │
│ QQuery 19 │ 41.26ms │ 42.08ms │ no change │
│ QQuery 20 │ 619.60ms │ 549.74ms │ +1.13x faster │
│ QQuery 21 │ 779.77ms │ 691.91ms │ +1.13x faster │
│ QQuery 22 │ 1411.33ms │ 1375.59ms │ no change │
│ QQuery 23 │ 3891.02ms │ 3946.51ms │ no change │
│ QQuery 24 │ 252.52ms │ 247.12ms │ no change │
│ QQuery 25 │ 252.81ms │ 248.90ms │ no change │
│ QQuery 26 │ 264.57ms │ 276.89ms │ no change │
│ QQuery 27 │ 842.86ms │ 854.72ms │ no change │
│ QQuery 28 │ 6461.67ms │ 6410.47ms │ no change │
│ QQuery 29 │ 379.88ms │ 359.71ms │ +1.06x faster │
│ QQuery 30 │ 352.77ms │ 332.15ms │ +1.06x faster │
│ QQuery 31 │ 366.75ms │ 371.79ms │ no change │
│ QQuery 32 │ 1273.77ms │ 1427.04ms │ 1.12x slower │
│ QQuery 33 │ 1601.21ms │ 1599.55ms │ no change │
│ QQuery 34 │ 1605.00ms │ 1701.18ms │ 1.06x slower │
│ QQuery 35 │ 532.88ms │ 576.30ms │ 1.08x slower │
│ QQuery 36 │ 109.92ms │ 115.27ms │ no change │
│ QQuery 37 │ 57.43ms │ 57.22ms │ no change │
│ QQuery 38 │ 78.78ms │ 80.17ms │ no change │
│ QQuery 39 │ 196.90ms │ 197.13ms │ no change │
│ QQuery 40 │ 26.52ms │ 25.77ms │ no change │
│ QQuery 41 │ 25.57ms │ 25.51ms │ no change │
│ QQuery 42 │ 30.03ms │ 28.90ms │ no change │
└──────────────┴───────────┴───────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (tmp) │ 28442.44ms │
│ Total Time (tmp) │ 28319.66ms │
│ Average Time (tmp) │ 661.45ms │
│ Average Time (tmp) │ 658.60ms │
│ Queries Faster │ 8 │
│ Queries Slower │ 4 │
│ Queries with No Change │ 31 │
└────────────────────────┴────────────┘
```rust
Using run --release result:
```rust
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ main ┃ default_enable_utf8view ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0 │ 0.34ms │ 0.33ms │ no change │
│ QQuery 1 │ 44.10ms │ 43.49ms │ no change │
│ QQuery 2 │ 77.24ms │ 76.79ms │ no change │
│ QQuery 3 │ 84.97ms │ 77.06ms │ +1.10x faster │
│ QQuery 4 │ 523.60ms │ 554.50ms │ 1.06x slower │
│ QQuery 5 │ 665.56ms │ 661.98ms │ no change │
│ QQuery 6 │ 38.69ms │ 39.05ms │ no change │
│ QQuery 7 │ 47.27ms │ 46.57ms │ no change │
│ QQuery 8 │ 702.86ms │ 682.91ms │ no change │
│ QQuery 9 │ 780.11ms │ 770.69ms │ no change │
│ QQuery 10 │ 194.09ms │ 172.24ms │ +1.13x faster │
│ QQuery 11 │ 199.83ms │ 191.13ms │ no change │
│ QQuery 12 │ 696.13ms │ 688.11ms │ no change │
│ QQuery 13 │ 890.35ms │ 1001.57ms │ 1.12x slower │
│ QQuery 14 │ 732.92ms │ 648.96ms │ +1.13x faster │
│ QQuery 15 │ 689.57ms │ 633.55ms │ +1.09x faster │
│ QQuery 16 │ 1415.16ms │ 1468.50ms │ no change │
│ QQuery 17 │ 1297.66ms │ 1319.06ms │ no change │
│ QQuery 18 │ 3272.62ms │ 2857.06ms │ +1.15x faster │
│ QQuery 19 │ 75.13ms │ 82.66ms │ 1.10x slower │
│ QQuery 20 │ 743.88ms │ 705.83ms │ +1.05x faster │
│ QQuery 21 │ 929.50ms │ 897.88ms │ no change │
│ QQuery 22 │ 2576.76ms │ 2506.14ms │ no change │
│ QQuery 23 │ 4943.09ms │ 4916.55ms │ no change │
│ QQuery 24 │ 392.47ms │ 384.78ms │ no change │
│ QQuery 25 │ 386.58ms │ 388.37ms │ no change │
│ QQuery 26 │ 423.42ms │ 417.96ms │ no change │
│ QQuery 27 │ 1050.88ms │ 976.19ms │ +1.08x faster │
│ QQuery 28 │ 8269.73ms │ 8791.73ms │ 1.06x slower │
│ QQuery 29 │ 439.96ms │ 442.74ms │ no change │
│ QQuery 30 │ 583.71ms │ 541.02ms │ +1.08x faster │
│ QQuery 31 │ 632.33ms │ 629.25ms │ no change │
│ QQuery 32 │ 2580.37ms │ 2523.11ms │ no change │
│ QQuery 33 │ 2810.58ms │ 2848.06ms │ no change │
│ QQuery 34 │ 3075.43ms │ 3108.88ms │ no change │
│ QQuery 35 │ 856.76ms │ 891.39ms │ no change │
│ QQuery 36 │ 152.80ms │ 150.15ms │ no change │
│ QQuery 37 │ 117.99ms │ 118.82ms │ no change │
│ QQuery 38 │ 110.05ms │ 112.33ms │ no change │
│ QQuery 39 │ 267.64ms │ 279.12ms │ no change │
│ QQuery 40 │ 41.34ms │ 45.52ms │ 1.10x slower │
│ QQuery 41 │ 42.04ms │ 41.70ms │ no change │
│ QQuery 42 │ 50.40ms │ 45.41ms │ +1.11x faster │
└──────────────┴────────────────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary ┃ ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main) │ 43905.93ms │
│ Total Time (default_enable_utf8view) │ 43779.11ms │
│ Average Time (main) │ 1021.07ms │
│ Average Time (default_enable_utf8view) │ 1018.12ms │
│ Queries Faster │ 9 │
│ Queries Slower │ 5 │
│ Queries with No Change │ 29 │
└────────────────────────────────────────┴────────────┘ |
Yes I would expect no change for the clickbench benchmark as it doesn't use SQL |
Update subtasks list (maybe)
|
Is your feature request related to a problem or challenge?
DataFusion uses Arrow types internally. Thus when planning SQL queries there is a mapping from SQL types to Arrow Types. The current mapping for character types is shown in the docs https://datafusion.apache.org/user-guide/sql/data_types.html#character-types
CHAR
Utf8
VARCHAR
Utf8
TEXT
Utf8
STRING
Utf8
So this means that when you do something like
create table foo(x varchar);
thex
column is Utf8When reading parquet files however, a different type,
Utf8View
is used as it is faster in most cases.This can be seen in this example:
Thus there is a discrepancy when creating external tables with a schema (
VARCHAR
) as that will use Utf8 rather than UTF8ViewI believe this is the root cause of the issue @zhuqi-lucas filed:
schema_force_view_type
configuration not working forCREATE EXTERNAL TABLE
#14909Describe the solution you'd like
I think we should consider changing the default SQL mapping from
VARCHAR
-->Utf8View
Describe alternatives you've considered
There are a few subtasks required before we can merge it:
Utf8View
) #15403Additional context
You can see some of the history related to using string view / Utf8View here:
StringView
in DataFusion #11752The text was updated successfully, but these errors were encountered: