Skip to content

Improve default output column names #6758

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
alamb opened this issue Jun 24, 2023 · 0 comments
Open

Improve default output column names #6758

alamb opened this issue Jun 24, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jun 24, 2023

Is your feature request related to a problem or challenge?

The way that DataFusion names the output columns can be quite messy. For example, the query below produces a column named "FIRST_VALUE(foo.free) PARTITION BY [foo.host] ORDER BY [foo.time DESC NULLS FIRST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + Int64(5) / Int64(12)" (yes that whole thing)

Arrow requires column names to be distinct strings and DataFusion creates the name based on the content of the expression.

Here is the entire query

❯ create table foo(host varchar, free int, time int) as values ('a', 1, 2), ('a', 3, 4);
0 rows in set. Query took 0.003 seconds.
❯ select first_value(free) over (partition by host order by time desc) + 5 / 12 from foo limit 10;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| FIRST_VALUE(foo.free) PARTITION BY [foo.host] ORDER BY [foo.time DESC NULLS FIRST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + Int64(5) / Int64(12) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 3                                                                                                                                                           |
| 3                                                                                                                                                           |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.003 seconds.

It also means you can't select the same expression without an alias, which while annoying is likely not a critical usecase

select free / 1, free / 1 from foo;
Error during planning: Projections require unique expression names but the expression "foo.free / Int64(1)" at position 0 and "foo.free / Int64(1)" at position 1 have the same name. Consider aliasing ("AS") one of them.
❯ select free / 1, free / 1 as col2 from foo;
+---------------------+------+
| foo.free / Int64(1) | col2 |
+---------------------+------+
| 1                   | 1    |
| 3                   | 3    |
+---------------------+------+

Describe the solution you'd like

While the output column names can be controlled via adding an explicit alias, I think it would be good to have a more concise way of naming the default outputs, especially for window functions

Perhaps something like not adding the OVER clause's contents so FIRST_VALUE(foo.free) in the above example.

If we do this we need to be careful to allow multiple calls to FIRST_VALUE with different OVER clauses in the same query

Describe alternatives you've considered

No response

Additional context

If we make the names more concise, we will probably need a solution like #6543 to ensure they remain unique

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant