Skip to content

Slow comparisions to dictionary columns with type coercion #10220

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
alamb opened this issue Apr 24, 2024 · 3 comments · Fixed by #10323
Closed

Slow comparisions to dictionary columns with type coercion #10220

alamb opened this issue Apr 24, 2024 · 3 comments · Fixed by #10323
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Apr 24, 2024

Is your feature request related to a problem or challenge?

In InfluxDB we use Dictionary(Int32, Utf8) columns a lot.

Queries like this (with string constants) work great and are very fast

SELECT ... WHERE column = '1'

Queries like this (note 1 is an integer, not a '1') the query goes very slow

SELECT ... WHERE column = 1

@erratic-pattern and I tracked this down to an issue/ limitation in type coercion:

Reproducer

DataFusion CLI v37.1.0
> create table test as values (arrow_cast('1', 'Dictionary(Int32, Utf8)'));
0 row(s) fetched.
Elapsed 0.010 seconds.

> select arrow_typeof(column1) from test;
+----------------------------+
| arrow_typeof(test.column1) |
+----------------------------+
| Dictionary(Int32, Utf8)    |
+----------------------------+
1 row(s) fetched.
Elapsed 0.002 seconds.

> explain SELECT * from test where column1 = 1;
+---------------+---------------------------------------------------+
| plan_type     | plan                                              |
+---------------+---------------------------------------------------+
| logical_plan  | Filter: CAST(test.column1 AS Utf8) = Utf8("1")    |
|               |   TableScan: test projection=[column1]            |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192       |
|               |   FilterExec: CAST(column1@0 AS Utf8) = 1         |
|               |     MemoryExec: partitions=1, partition_sizes=[1] |
|               |                                                   |
+---------------+---------------------------------------------------+
2 row(s) fetched.
Elapsed 0.003 seconds.

I think this shows the core problem:

| logical_plan  | Filter: CAST(test.column1 AS Utf8) = Utf8("1")    |

It basically shows the column is being converted to a string, rather than the constant being converted to th ecorrect type.

Not only does this mean the column is being un-encoded for the comparsion, it also means that PruningPredicate doesn't work either

Describe the solution you'd like

I would like the query to go fast lol

Specifically, I think the filter should look like this (no cast on the column, and instead the constant type matches)

| logical_plan  | Filter: test.column1 = Dictionary(Int32, Utf8("1")) |

Note this is what happens if you compare the dictionary column to a string literal:

> explain SELECT * from test where column1 = '1';
+---------------+-----------------------------------------------------+
| plan_type     | plan                                                |
+---------------+-----------------------------------------------------+
| logical_plan  | Filter: test.column1 = Dictionary(Int32, Utf8("1")) |
|               |   TableScan: test projection=[column1]              |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192         |
|               |   FilterExec: column1@0 = 1                         |
|               |     MemoryExec: partitions=1, partition_sizes=[1]   |
|               |                                                     |
+---------------+-----------------------------------------------------+
2 row(s) fetched.
Elapsed 0.002 seconds.

>

Describe alternatives you've considered

We could potentially update the coercion logic to coerce 1 to Dictionary(.. "1") or maybe update the unwrap_comparsion logic

Additional context

No response

@erratic-pattern
Copy link
Contributor

erratic-pattern commented Apr 24, 2024

I have a PR that fixes this. #10221 Here is the explain after making the change:

> explain SELECT * from test where column1 = 1;
+---------------+-----------------------------------------------------+
| plan_type     | plan                                                |
+---------------+-----------------------------------------------------+
| logical_plan  | Filter: test.column1 = Dictionary(Int32, Utf8("1")) |
|               |   TableScan: test projection=[column1]              |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192         |
|               |   FilterExec: column1@0 = 1                         |
|               |     MemoryExec: partitions=1, partition_sizes=[1]   |
|               |                                                     |
+---------------+-----------------------------------------------------+
2 row(s) fetched.
Elapsed 0.008 seconds.

However it looks like some tests are failing so I am still looking into it.

@erratic-pattern
Copy link
Contributor

#10323 is ready for review and avoids the previously discussed issues with #10221

@alamb
Copy link
Contributor Author

alamb commented Apr 30, 2024

Thanks @erratic-pattern -- I hope to look at this tomorrow morning

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants