Improve default output column names #6758

alamb · 2023-06-24T12:24:13Z

Is your feature request related to a problem or challenge?

The way that DataFusion names the output columns can be quite messy. For example, the query below produces a column named "FIRST_VALUE(foo.free) PARTITION BY [foo.host] ORDER BY [foo.time DESC NULLS FIRST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + Int64(5) / Int64(12)" (yes that whole thing)

Arrow requires column names to be distinct strings and DataFusion creates the name based on the content of the expression.

Here is the entire query

❯ create table foo(host varchar, free int, time int) as values ('a', 1, 2), ('a', 3, 4);
0 rows in set. Query took 0.003 seconds.
❯ select first_value(free) over (partition by host order by time desc) + 5 / 12 from foo limit 10;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| FIRST_VALUE(foo.free) PARTITION BY [foo.host] ORDER BY [foo.time DESC NULLS FIRST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + Int64(5) / Int64(12) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 3                                                                                                                                                           |
| 3                                                                                                                                                           |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.003 seconds.

It also means you can't select the same expression without an alias, which while annoying is likely not a critical usecase

❯ select free / 1, free / 1 from foo;
Error during planning: Projections require unique expression names but the expression "foo.free / Int64(1)" at position 0 and "foo.free / Int64(1)" at position 1 have the same name. Consider aliasing ("AS") one of them.
❯ select free / 1, free / 1 as col2 from foo;
+---------------------+------+
| foo.free / Int64(1) | col2 |
+---------------------+------+
| 1                   | 1    |
| 3                   | 3    |
+---------------------+------+

Describe the solution you'd like

While the output column names can be controlled via adding an explicit alias, I think it would be good to have a more concise way of naming the default outputs, especially for window functions

Perhaps something like not adding the OVER clause's contents so FIRST_VALUE(foo.free) in the above example.

If we do this we need to be careful to allow multiple calls to FIRST_VALUE with different OVER clauses in the same query

Describe alternatives you've considered

No response

Additional context

If we make the names more concise, we will probably need a solution like #6543 to ensure they remain unique

The text was updated successfully, but these errors were encountered:

alamb added the enhancement New feature or request label Jun 24, 2023

This was referenced Jun 24, 2023

Mismatch in MemTable of Select Into when projecting on aggregate window functions #6566

Merged

Support columns having the same alias #6543

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve default output column names #6758

Improve default output column names #6758

alamb commented Jun 24, 2023

Improve default output column names #6758

Improve default output column names #6758

Comments

alamb commented Jun 24, 2023

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context