Skip to content

Change mapping of SQL VARCHAR from Utf8 to Utf8View #15096

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
3 of 6 tasks
Tracked by #15771 ...
alamb opened this issue Mar 8, 2025 · 18 comments
Open
3 of 6 tasks
Tracked by #15771 ...

Change mapping of SQL VARCHAR from Utf8 to Utf8View #15096

alamb opened this issue Mar 8, 2025 · 18 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Mar 8, 2025

Is your feature request related to a problem or challenge?

DataFusion uses Arrow types internally. Thus when planning SQL queries there is a mapping from SQL types to Arrow Types. The current mapping for character types is shown in the docs https://datafusion.apache.org/user-guide/sql/data_types.html#character-types

SQL DataType Arrow DataType
CHAR Utf8
VARCHAR Utf8
TEXT Utf8
STRING Utf8

So this means that when you do something like create table foo(x varchar); the x column is Utf8

DataFusion CLI v46.0.0
> create table foo(x varchar);

0 row(s) fetched.
Elapsed 0.019 seconds.

> describe foo;
+-------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-------------+-----------+-------------+
| x           | Utf8      | YES         |
+-------------+-----------+-------------+
1 row(s) fetched.
Elapsed 0.008 seconds.

When reading parquet files however, a different type, Utf8View is used as it is faster in most cases.

This can be seen in this example:

DataFusion CLI v46.0.0
> describe 'hits.parquet';
+-----------------------+-----------+-------------+
| column_name           | data_type | is_nullable |
+-----------------------+-----------+-------------+
| WatchID               | Int64     | NO          |
| JavaEnable            | Int16     | NO          |
| Title                 | Utf8View  | NO          |
...
+-----------------------+-----------+-------------+
105 row(s) fetched.
Elapsed 0.032 seconds.

Thus there is a discrepancy when creating external tables with a schema (VARCHAR) as that will use Utf8 rather than UTF8View

I believe this is the root cause of the issue @zhuqi-lucas filed:

Describe the solution you'd like

I think we should consider changing the default SQL mapping from VARCHAR --> Utf8View

Describe alternatives you've considered

There are a few subtasks required before we can merge it:

Additional context

You can see some of the history related to using string view / Utf8View here:

@alamb
Copy link
Contributor Author

alamb commented Mar 8, 2025

Please add comments if you find other needed items / issues

@alamb
Copy link
Contributor Author

alamb commented Mar 8, 2025

To begin this project so that we can implement it incrementally, I suggest we create a new config option like default_sql_to_view_types or something so we can test / work through issues before turning it on by default

@zhuqi-lucas
Copy link
Contributor

To begin this project so that we can implement it incrementally, I suggest we create a new config option like default_sql_to_view_types or something so we can test / work through issues before turning it on by default

Thank you @alamb , this is a great suggestion! And we finally can make it default to true when we finish all tasks!

@zhuqi-lucas
Copy link
Contributor

I will try to create more sub-tasks related to this effort!

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Mar 9, 2025

  • Config: Add support default sql varchar to view types

Submitted a PR for the first step of our incremental work:
#15104

cc @alamb Thanks!

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Mar 9, 2025

I also testing the tcph when it use the utf8view default, here is the result:

./benchmarks/bench.sh compare main issue_14909
Comparing main and issue_14909
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃    main ┃ issue_14909 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 172.78ms │     70.32ms │     no change │
│ QQuery 227.64ms │     26.30ms │     no change │
│ QQuery 337.75ms │     38.22ms │     no change │
│ QQuery 427.97ms │     28.27ms │     no change │
│ QQuery 552.59ms │     49.36ms │ +1.07x faster │
│ QQuery 620.81ms │     20.66ms │     no change │
│ QQuery 770.44ms │     75.06ms │  1.07x slower │
│ QQuery 848.32ms │     49.02ms │     no change │
│ QQuery 962.60ms │     63.14ms │     no change │
│ QQuery 1055.94ms │     58.75ms │  1.05x slower │
│ QQuery 1119.44ms │     21.21ms │  1.09x slower │
│ QQuery 1236.59ms │     37.42ms │     no change │
│ QQuery 1334.05ms │     34.88ms │     no change │
│ QQuery 1426.50ms │     26.77ms │     no change │
│ QQuery 1542.97ms │     45.06ms │     no change │
│ QQuery 1619.25ms │     20.02ms │     no change │
│ QQuery 1773.64ms │     68.81ms │ +1.07x faster │
│ QQuery 1896.62ms │     95.08ms │     no change │
│ QQuery 1946.77ms │     45.75ms │     no change │
│ QQuery 2045.54ms │     40.98ms │ +1.11x faster │
│ QQuery 2195.29ms │     95.19ms │     no change │
│ QQuery 2218.34ms │     17.99ms │     no change │
└──────────────┴─────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary          ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)1031.87ms │
│ Total Time (issue_14909)1028.26ms │
│ Average Time (main)46.90ms │
│ Average Time (issue_14909)46.74ms │
│ Queries Faster3 │
│ Queries Slower3 │
│ Queries with No Change16 │
└────────────────────────────┴───────────┘

@zhuqi-lucas
Copy link
Contributor

Create the ticket for avro:

  • Support Utf8View for avro #7262

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Mar 11, 2025

New sub_task:

  • topk functionality for aggregates should support utf8view

PR: #15152

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Mar 14, 2025

New sub_task:

  • Support logic optimize rule to pass the case that Utf8view datatype combined with Utf8 datatype

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Mar 16, 2025

New sub_task:

  • Support Utf8View datatype for range queries

Submitted a PR:
#15257

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Mar 16, 2025

New sub_task:

  • Performance improvement: Binary operators native support for Utf8View

@alamb
Copy link
Contributor Author

alamb commented Mar 17, 2025

erformance improvement: Binary operators native support for Utf8View

Yes, 100%

@zhuqi-lucas
Copy link
Contributor

erformance improvement: Binary operators native support for Utf8View

Yes, 100%

Submitted the PR for review:
#15275

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Mar 21, 2025

New sub_task:

  • Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream

@zhuqi-lucas
Copy link
Contributor

Updated:

Most of the tasks are resolved, i am trying to do more performance investigation and testing if we default change to Utf8View for all varchar.

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Mar 28, 2025

Also updated the latest clickbench for the current main compare the default mapping varchar to utf8view:

Small improvement, i think becasue it's parquet format, mostly we already load it as the Utf8View for benchmark:

Using --profile release-nonlto result:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       tmp ┃       tmp ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 00.36ms │    0.32ms │ +1.13x faster │
│ QQuery 127.55ms │   28.53ms │     no change │
│ QQuery 254.73ms │   58.73ms │  1.07x slower │
│ QQuery 349.67ms │   51.85ms │     no change │
│ QQuery 4321.47ms │  333.88ms │     no change │
│ QQuery 5374.77ms │  370.26ms │     no change │
│ QQuery 626.06ms │   27.14ms │     no change │
│ QQuery 729.89ms │   28.19ms │ +1.06x faster │
│ QQuery 8397.90ms │  372.24ms │ +1.07x faster │
│ QQuery 9583.40ms │  599.98ms │     no change │
│ QQuery 10148.50ms │  147.43ms │     no change │
│ QQuery 11163.20ms │  165.56ms │     no change │
│ QQuery 12399.31ms │  407.08ms │     no change │
│ QQuery 13568.61ms │  576.26ms │     no change │
│ QQuery 14389.05ms │  374.46ms │     no change │
│ QQuery 15375.66ms │  370.85ms │     no change │
│ QQuery 16720.94ms │  719.03ms │     no change │
│ QQuery 17662.21ms │  638.33ms │     no change │
│ QQuery 181694.34ms │ 1507.92ms │ +1.12x faster │
│ QQuery 1941.26ms │   42.08ms │     no change │
│ QQuery 20619.60ms │  549.74ms │ +1.13x faster │
│ QQuery 21779.77ms │  691.91ms │ +1.13x faster │
│ QQuery 221411.33ms │ 1375.59ms │     no change │
│ QQuery 233891.02ms │ 3946.51ms │     no change │
│ QQuery 24252.52ms │  247.12ms │     no change │
│ QQuery 25252.81ms │  248.90ms │     no change │
│ QQuery 26264.57ms │  276.89ms │     no change │
│ QQuery 27842.86ms │  854.72ms │     no change │
│ QQuery 286461.67ms │ 6410.47ms │     no change │
│ QQuery 29379.88ms │  359.71ms │ +1.06x faster │
│ QQuery 30352.77ms │  332.15ms │ +1.06x faster │
│ QQuery 31366.75ms │  371.79ms │     no change │
│ QQuery 321273.77ms │ 1427.04ms │  1.12x slower │
│ QQuery 331601.21ms │ 1599.55ms │     no change │
│ QQuery 341605.00ms │ 1701.18ms │  1.06x slower │
│ QQuery 35532.88ms │  576.30ms │  1.08x slower │
│ QQuery 36109.92ms │  115.27ms │     no change │
│ QQuery 3757.43ms │   57.22ms │     no change │
│ QQuery 3878.78ms │   80.17ms │     no change │
│ QQuery 39196.90ms │  197.13ms │     no change │
│ QQuery 4026.52ms │   25.77ms │     no change │
│ QQuery 4125.57ms │   25.51ms │     no change │
│ QQuery 4230.03ms │   28.90ms │     no change │
└──────────────┴───────────┴───────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (tmp)28442.44ms │
│ Total Time (tmp)28319.66ms │
│ Average Time (tmp)661.45ms │
│ Average Time (tmp)658.60ms │
│ Queries Faster8 │
│ Queries Slower4 │
│ Queries with No Change31 │
└────────────────────────┴────────────┘
```rust


Using run --release result:


```rust
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main                   ┃ default_enable_utf8view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 00.34ms │                  0.33ms │     no change │
│ QQuery 144.10ms │                 43.49ms │     no change │
│ QQuery 277.24ms │                 76.79ms │     no change │
│ QQuery 384.97ms │                 77.06ms │ +1.10x faster │
│ QQuery 4523.60ms │                554.50ms │  1.06x slower │
│ QQuery 5665.56ms │                661.98ms │     no change │
│ QQuery 638.69ms │                 39.05ms │     no change │
│ QQuery 747.27ms │                 46.57ms │     no change │
│ QQuery 8702.86ms │                682.91ms │     no change │
│ QQuery 9780.11ms │                770.69ms │     no change │
│ QQuery 10194.09ms │                172.24ms │ +1.13x faster │
│ QQuery 11199.83ms │                191.13ms │     no change │
│ QQuery 12696.13ms │                688.11ms │     no change │
│ QQuery 13890.35ms │               1001.57ms │  1.12x slower │
│ QQuery 14732.92ms │                648.96ms │ +1.13x faster │
│ QQuery 15689.57ms │                633.55ms │ +1.09x faster │
│ QQuery 161415.16ms │               1468.50ms │     no change │
│ QQuery 171297.66ms │               1319.06ms │     no change │
│ QQuery 183272.62ms │               2857.06ms │ +1.15x faster │
│ QQuery 1975.13ms │                 82.66ms │  1.10x slower │
│ QQuery 20743.88ms │                705.83ms │ +1.05x faster │
│ QQuery 21929.50ms │                897.88ms │     no change │
│ QQuery 222576.76ms │               2506.14ms │     no change │
│ QQuery 234943.09ms │               4916.55ms │     no change │
│ QQuery 24392.47ms │                384.78ms │     no change │
│ QQuery 25386.58ms │                388.37ms │     no change │
│ QQuery 26423.42ms │                417.96ms │     no change │
│ QQuery 271050.88ms │                976.19ms │ +1.08x faster │
│ QQuery 288269.73ms │               8791.73ms │  1.06x slower │
│ QQuery 29439.96ms │                442.74ms │     no change │
│ QQuery 30583.71ms │                541.02ms │ +1.08x faster │
│ QQuery 31632.33ms │                629.25ms │     no change │
│ QQuery 322580.37ms │               2523.11ms │     no change │
│ QQuery 332810.58ms │               2848.06ms │     no change │
│ QQuery 343075.43ms │               3108.88ms │     no change │
│ QQuery 35856.76ms │                891.39ms │     no change │
│ QQuery 36152.80ms │                150.15ms │     no change │
│ QQuery 37117.99ms │                118.82ms │     no change │
│ QQuery 38110.05ms │                112.33ms │     no change │
│ QQuery 39267.64ms │                279.12ms │     no change │
│ QQuery 4041.34ms │                 45.52ms │  1.10x slower │
│ QQuery 4142.04ms │                 41.70ms │     no change │
│ QQuery 4250.40ms │                 45.41ms │ +1.11x faster │
└──────────────┴────────────────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main)43905.93ms │
│ Total Time (default_enable_utf8view)43779.11ms │
│ Average Time (main)1021.07ms │
│ Average Time (default_enable_utf8view)1018.12ms │
│ Queries Faster9 │
│ Queries Slower5 │
│ Queries with No Change29 │
└────────────────────────────────────────┴────────────┘

@alamb
Copy link
Contributor Author

alamb commented Mar 28, 2025

Small improvement, i think becasue it's parquet format, mostly we already load it as the Utf8View for benchmark:

Yes I would expect no change for the clickbench benchmark as it doesn't use SQL VARCHAR 🤔

@getChan
Copy link
Contributor

getChan commented Apr 12, 2025

Update subtasks list (maybe)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants