Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

pl.lit doesn't work properly as a join key #9603

Closed
mcrumiller opened this issue Jun 28, 2023 · 7 comments · Fixed by #20477
Closed

pl.lit doesn't work properly as a join key #9603

mcrumiller opened this issue Jun 28, 2023 · 7 comments · Fixed by #20477
Assignees
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@mcrumiller
Copy link
Contributor

mcrumiller commented Jun 28, 2023

Update 2024-12-27: This still fails, but with a different error:

df1.join(df2, left_on=['a', pl.lit('b')], right_on=['a', 'b'], how="left")
# pyo3_runtime.PanicException: assertion `left == right` failed
#   left: 1
#  right: 4

Update: As per this comment the issue has changed but the title is still relevant. Here is new behavior:

import polars as pl

df1 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
})

df2 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
    'b': ['a', 'a', 'b', 'b'],
})

df1.join(df2, left_on=['a', pl.lit('b')], right_on=['a', 'b'], how="left")
# shape: (4, 3)
# ┌─────┬─────────┬──────┐
# │ a   ┆ a_right ┆ b    │
# │ --- ┆ ---     ┆ ---  │
# │ str ┆ str     ┆ str  │
# ╞═════╪═════════╪══════╡
# │ 1   ┆ null    ┆ null │
# │ 2   ┆ null    ┆ null │
# │ 3   ┆ null    ┆ null │  <-- should successfully join here
# │ 4   ┆ null    ┆ null │  <-- should successfully join here
# └─────┴─────────┴──────┘

Issue description

A pl.lit value apparently matches everything, regardless of value.

Reproducible example

import polars as pl

df1 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
})

df2 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
    'b': ['a', 'a', 'b', 'b'],
})

df1.join(
    df2,
    left_on=['a', pl.lit('b')],
    right_on=['a', 'b'],
    how="left",
)
shape: (4, 2)
┌─────┬─────────┐
│ a   ┆ literal │
│ --- ┆ ---     │
│ str ┆ str     │
╞═════╪═════════╡
│ 1   ┆ b       │
│ 2   ┆ b       │
│ 3   ┆ b       │
│ 4   ┆ b       │
└─────┴─────────┘

Expected behavior

First two records should be null.

Installed versions

--------Version info---------
Polars:      0.18.1
Index type:  UInt32
Platform:    Windows-10-10.0.19045-SP0
Python:      3.11.2 (tags/v3.11.2:878ead1, Feb  7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)]

----Optional dependencies----
numpy:       1.24.3
pandas:      2.0.0
pyarrow:     11.0.0
connectorx:  0.3.2a3
deltalake:   <not installed>
fsspec:      <not installed>
matplotlib:  3.7.1
xlsx2csv:    0.8.1
xlsxwriter:  3.1.0
@mcrumiller mcrumiller added bug Something isn't working python Related to Python Polars labels Jun 28, 2023
@stinodego stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024
@edavisau
Copy link
Contributor

I had a look into this, and I noticed that the description is slightly inaccurate. In this example, the pl.lit is matching nowhere.

The problem is that the literal doesn't broadcast when doing the join. Here is an illustrative example where only the first row is matched because the literal (effectively) expands to ["a", None, None, None] in the join implementation

import polars as pl

df1 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
})

df2 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
    'b': ['a', 'a', 'b', 'b'],
    'extra_col': [101, 102, 103, 104]
})


df1.join(
    df2,
    left_on=['a', pl.lit('a')],
    right_on=['a', 'b'],
    how="left",
)
shape: (4, 3)
┌─────┬─────────┬───────────┐
│ a   ┆ literal ┆ extra_col │
│ --- ┆ ---     ┆ ---       │
│ str ┆ str     ┆ i64       │
╞═════╪═════════╪═══════════╡
│ 1   ┆ a       ┆ 101       │
│ 2   ┆ a       ┆ null      │
│ 3   ┆ a       ┆ null      │
│ 4   ┆ a       ┆ null      │
└─────┴─────────┴───────────┘

@mcrumiller
Copy link
Contributor Author

@edavisau I do see the non-broadcasting issue here, good find. Not sure if #9621 can be simultaneously resolved but I do not think the literal column should be in the result set.

@edavisau
Copy link
Contributor

@mcrumiller I noticed this as well, to me it's a fundamental flaw with the current implementation of joins. For left joins, for example, polars effectively does

  • left_df.with_columns(left_on columns)
  • right_df.drop(right_on columns)

Which doesn't work well with "calculated columns"

In my mentioned PR above, the new behaviour would be

import polars as pl

df1 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
})

df2 = pl.DataFrame({
    'a': ['1', '2', '3', '4'],
    'b': ['a', 'a', 'b', 'b'],
    'extra_col': [101, 102, 103, 104]
})


df1.join(
    df2,
    left_on=['a', pl.lit('a')],
    right_on=['a', 'b'],
    how="left",
)
shape: (4, 3)
┌─────┬──────┬───────────┐
│ a   ┆ b    ┆ extra_col │
│ --- ┆ ---  ┆ ---       │
│ str ┆ str  ┆ i64       │
╞═════╪══════╪═══════════╡
│ 1   ┆ a    ┆ 101       │
│ 2   ┆ a    ┆ 102       │
│ 3   ┆ null ┆ null      │
│ 4   ┆ null ┆ null      │
└─────┴──────┴───────────┘

However, it's a big breaking change, and IMO should be decided simultaneously with issues like #13441 which is scheduled for 1.0 release.

@mcrumiller
Copy link
Contributor Author

mcrumiller commented Jan 26, 2024

It's similar to not coalescing, but it's not the same. If the join condition is itself a calculation, it shouldn't be included in the output. In SQL, for example, you can do:

SELECT
  A.*, B.*
FROM A
LEFT JOIN B ON
  A.value < B.value

After doing this your A.* doesn't include a column of True/False values indicating whether A.value < B.value. I don't believe this is intended in polars either, further evidenced by the fact that you cannot alias the join expression.

@edavisau
Copy link
Contributor

That makes sense, yep my updated condition was to check that the left_on and right_on are both not calculated expressions. To do this I compared the names and the pointer to the underlying data - here.

My point was that it all should be decided at once what the new "join behaviour" should be in 1.0. I know they are not the same issue but they are quite interdependent in my opinion.

@mcrumiller
Copy link
Contributor Author

Looks like this is now wrong but for a different reason, as of 1.0.1:

df1.join(df2, left_on=['a', pl.lit('b')], right_on=['a', 'b'], how="left")
# shape: (4, 3)
# ┌─────┬─────────┬──────┐
# │ a   ┆ a_right ┆ b    │
# │ --- ┆ ---     ┆ ---  │
# │ str ┆ str     ┆ str  │
# ╞═════╪═════════╪══════╡
# │ 1   ┆ null    ┆ null │
# │ 2   ┆ null    ┆ null │
# │ 3   ┆ null    ┆ null │  <-- should successfully join here
# │ 4   ┆ null    ┆ null │  <-- should successfully join here
# └─────┴─────────┴──────┘

@ritchie46 unsure if should create new issue.

@AlexeyDmitriev
Copy link

I have an example that is somewhat simpler in my opinion, which also returns strange result for original issue

L = pl.DataFrame({'a': [1,2]})
R = pl.DataFrame({'b': [3,4,5]})
L.join(R, left_on=pl.col('a') - pl.col('a'), right_on=pl.col('b') - pl.col('b')) # 6 lines as expected, full cross product
L.join(R, left_on=pl.lit(0), right_on=pl.lit(0)) # only 1 line, expected to be the same as previous

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants