`pl.lit` doesn't work properly as a join key #9603

mcrumiller · 2023-06-28T20:33:01Z

Update 2024-12-27: This still fails, but with a different error:

df1.join(df2, left_on=['a', pl.lit('b')], right_on=['a', 'b'], how="left")
# pyo3_runtime.PanicException: assertion `left == right` failed
#   left: 1
#  right: 4

Update: As per this comment the issue has changed but the title is still relevant. Here is new behavior:

import polars as pl

df1 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
})

df2 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
    'b': ['a', 'a', 'b', 'b'],
})

df1.join(df2, left_on=['a', pl.lit('b')], right_on=['a', 'b'], how="left")
# shape: (4, 3)
# ┌─────┬─────────┬──────┐
# │ a   ┆ a_right ┆ b    │
# │ --- ┆ ---     ┆ ---  │
# │ str ┆ str     ┆ str  │
# ╞═════╪═════════╪══════╡
# │ 1   ┆ null    ┆ null │
# │ 2   ┆ null    ┆ null │
# │ 3   ┆ null    ┆ null │  <-- should successfully join here
# │ 4   ┆ null    ┆ null │  <-- should successfully join here
# └─────┴─────────┴──────┘

Issue description

A pl.lit value apparently matches everything, regardless of value.

Reproducible example

import polars as pl

df1 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
})

df2 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
    'b': ['a', 'a', 'b', 'b'],
})

df1.join(
    df2,
    left_on=['a', pl.lit('b')],
    right_on=['a', 'b'],
    how="left",
)

shape: (4, 2)
┌─────┬─────────┐
│ a   ┆ literal │
│ --- ┆ ---     │
│ str ┆ str     │
╞═════╪═════════╡
│ 1   ┆ b       │
│ 2   ┆ b       │
│ 3   ┆ b       │
│ 4   ┆ b       │
└─────┴─────────┘

Expected behavior

First two records should be null.

Installed versions

--------Version info---------
Polars:      0.18.1
Index type:  UInt32
Platform:    Windows-10-10.0.19045-SP0
Python:      3.11.2 (tags/v3.11.2:878ead1, Feb  7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)]

----Optional dependencies----
numpy:       1.24.3
pandas:      2.0.0
pyarrow:     11.0.0
connectorx:  0.3.2a3
deltalake:   <not installed>
fsspec:      <not installed>
matplotlib:  3.7.1
xlsx2csv:    0.8.1
xlsxwriter:  3.1.0

The text was updated successfully, but these errors were encountered:

edavisau · 2024-01-26T06:38:55Z

I had a look into this, and I noticed that the description is slightly inaccurate. In this example, the pl.lit is matching nowhere.

The problem is that the literal doesn't broadcast when doing the join. Here is an illustrative example where only the first row is matched because the literal (effectively) expands to ["a", None, None, None] in the join implementation

import polars as pl

df1 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
})

df2 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
    'b': ['a', 'a', 'b', 'b'],
    'extra_col': [101, 102, 103, 104]
})


df1.join(
    df2,
    left_on=['a', pl.lit('a')],
    right_on=['a', 'b'],
    how="left",
)

shape: (4, 3)
┌─────┬─────────┬───────────┐
│ a   ┆ literal ┆ extra_col │
│ --- ┆ ---     ┆ ---       │
│ str ┆ str     ┆ i64       │
╞═════╪═════════╪═══════════╡
│ 1   ┆ a       ┆ 101       │
│ 2   ┆ a       ┆ null      │
│ 3   ┆ a       ┆ null      │
│ 4   ┆ a       ┆ null      │
└─────┴─────────┴───────────┘

mcrumiller · 2024-01-26T14:45:01Z

@edavisau I do see the non-broadcasting issue here, good find. Not sure if #9621 can be simultaneously resolved but I do not think the literal column should be in the result set.

edavisau · 2024-01-26T21:32:43Z

@mcrumiller I noticed this as well, to me it's a fundamental flaw with the current implementation of joins. For left joins, for example, polars effectively does

left_df.with_columns(left_on columns)
right_df.drop(right_on columns)

Which doesn't work well with "calculated columns"

In my mentioned PR above, the new behaviour would be

import polars as pl

df1 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
})

df2 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
    'b': ['a', 'a', 'b', 'b'],
    'extra_col': [101, 102, 103, 104]
})


df1.join(
    df2,
    left_on=['a', pl.lit('a')],
    right_on=['a', 'b'],
    how="left",
)

shape: (4, 3)
┌─────┬──────┬───────────┐
│ a   ┆ b    ┆ extra_col │
│ --- ┆ ---  ┆ ---       │
│ str ┆ str  ┆ i64       │
╞═════╪══════╪═══════════╡
│ 1   ┆ a    ┆ 101       │
│ 2   ┆ a    ┆ 102       │
│ 3   ┆ null ┆ null      │
│ 4   ┆ null ┆ null      │
└─────┴──────┴───────────┘

However, it's a big breaking change, and IMO should be decided simultaneously with issues like #13441 which is scheduled for 1.0 release.

mcrumiller · 2024-01-26T21:39:00Z

It's similar to not coalescing, but it's not the same. If the join condition is itself a calculation, it shouldn't be included in the output. In SQL, for example, you can do:

SELECT
  A.*, B.*
FROM A
LEFT JOIN B ON
  A.value < B.value

After doing this your A.* doesn't include a column of True/False values indicating whether A.value < B.value. I don't believe this is intended in polars either, further evidenced by the fact that you cannot alias the join expression.

edavisau · 2024-01-26T21:51:05Z

That makes sense, yep my updated condition was to check that the left_on and right_on are both not calculated expressions. To do this I compared the names and the pointer to the underlying data - here.

My point was that it all should be decided at once what the new "join behaviour" should be in 1.0. I know they are not the same issue but they are quite interdependent in my opinion.

mcrumiller · 2024-07-07T17:18:02Z

Looks like this is now wrong but for a different reason, as of 1.0.1:

df1.join(df2, left_on=['a', pl.lit('b')], right_on=['a', 'b'], how="left")
# shape: (4, 3)
# ┌─────┬─────────┬──────┐
# │ a   ┆ a_right ┆ b    │
# │ --- ┆ ---     ┆ ---  │
# │ str ┆ str     ┆ str  │
# ╞═════╪═════════╪══════╡
# │ 1   ┆ null    ┆ null │
# │ 2   ┆ null    ┆ null │
# │ 3   ┆ null    ┆ null │  <-- should successfully join here
# │ 4   ┆ null    ┆ null │  <-- should successfully join here
# └─────┴─────────┴──────┘

@ritchie46 unsure if should create new issue.

AlexeyDmitriev · 2024-08-28T15:43:46Z

I have an example that is somewhat simpler in my opinion, which also returns strange result for original issue

L = pl.DataFrame({'a': [1,2]})
R = pl.DataFrame({'b': [3,4,5]})
L.join(R, left_on=pl.col('a') - pl.col('a'), right_on=pl.col('b') - pl.col('b')) # 6 lines as expected, full cross product
L.join(R, left_on=pl.lit(0), right_on=pl.lit(0)) # only 1 line, expected to be the same as previous

mcrumiller added bug Something isn't working python Related to Python Polars labels Jun 28, 2023

stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024

edavisau mentioned this issue Jan 26, 2024

join changes: broadcast left/right_on expressions + omit left_on/right_on expressions in result #14007

Draft

wence- mentioned this issue Jun 25, 2024

[QST/DOC]: Join key matching semantics for expression keys #17184

Open

mcrumiller mentioned this issue Aug 28, 2024

SELECT 0 from table_name return 1 line instead of expected number_of_lines(table_name) #18404

Closed

2 tasks

AlexeyDmitriev mentioned this issue Sep 11, 2024

Some functions do not accept arbitrary expressions. Should it be part of an API? #18695

Open

mcrumiller mentioned this issue Dec 27, 2024

fix: Fix join literal behavior #20477

Merged

ritchie46 closed this as completed in #20477 Dec 27, 2024

c-peters added the accepted Ready for implementation label Dec 29, 2024

c-peters assigned ritchie46 Dec 29, 2024

c-peters moved this to Done in Backlog Dec 29, 2024

c-peters added this to Backlog Dec 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`pl.lit` doesn't work properly as a join key #9603

`pl.lit` doesn't work properly as a join key #9603

mcrumiller commented Jun 28, 2023 •

edited

Loading

edavisau commented Jan 26, 2024

mcrumiller commented Jan 26, 2024

edavisau commented Jan 26, 2024

mcrumiller commented Jan 26, 2024 •

edited

Loading

edavisau commented Jan 26, 2024

mcrumiller commented Jul 7, 2024

AlexeyDmitriev commented Aug 28, 2024

pl.lit doesn't work properly as a join key #9603

pl.lit doesn't work properly as a join key #9603

Comments

mcrumiller commented Jun 28, 2023 • edited Loading

Issue description

Reproducible example

Expected behavior

Installed versions

edavisau commented Jan 26, 2024

mcrumiller commented Jan 26, 2024

edavisau commented Jan 26, 2024

mcrumiller commented Jan 26, 2024 • edited Loading

edavisau commented Jan 26, 2024

mcrumiller commented Jul 7, 2024

AlexeyDmitriev commented Aug 28, 2024

`pl.lit` doesn't work properly as a join key #9603

`pl.lit` doesn't work properly as a join key #9603

mcrumiller commented Jun 28, 2023 •

edited

Loading

mcrumiller commented Jan 26, 2024 •

edited

Loading