-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
pl.lit
doesn't work properly as a join key
#9603
Comments
I had a look into this, and I noticed that the description is slightly inaccurate. In this example, the pl.lit is matching nowhere. The problem is that the literal doesn't broadcast when doing the join. Here is an illustrative example where only the first row is matched because the literal (effectively) expands to import polars as pl
df1 = pl.DataFrame({
'a': ['1', '2', '3', '4'],
})
df2 = pl.DataFrame({
'a': ['1', '2', '3', '4'],
'b': ['a', 'a', 'b', 'b'],
'extra_col': [101, 102, 103, 104]
})
df1.join(
df2,
left_on=['a', pl.lit('a')],
right_on=['a', 'b'],
how="left",
)
|
@mcrumiller I noticed this as well, to me it's a fundamental flaw with the current implementation of joins. For left joins, for example, polars effectively does
Which doesn't work well with "calculated columns" In my mentioned PR above, the new behaviour would be import polars as pl
df1 = pl.DataFrame({
'a': ['1', '2', '3', '4'],
})
df2 = pl.DataFrame({
'a': ['1', '2', '3', '4'],
'b': ['a', 'a', 'b', 'b'],
'extra_col': [101, 102, 103, 104]
})
df1.join(
df2,
left_on=['a', pl.lit('a')],
right_on=['a', 'b'],
how="left",
)
However, it's a big breaking change, and IMO should be decided simultaneously with issues like #13441 which is scheduled for 1.0 release. |
It's similar to not coalescing, but it's not the same. If the join condition is itself a calculation, it shouldn't be included in the output. In SQL, for example, you can do:
After doing this your |
That makes sense, yep my updated condition was to check that the left_on and right_on are both not calculated expressions. To do this I compared the names and the pointer to the underlying data - here. My point was that it all should be decided at once what the new "join behaviour" should be in 1.0. I know they are not the same issue but they are quite interdependent in my opinion. |
Looks like this is now wrong but for a different reason, as of 1.0.1: df1.join(df2, left_on=['a', pl.lit('b')], right_on=['a', 'b'], how="left")
# shape: (4, 3)
# ┌─────┬─────────┬──────┐
# │ a ┆ a_right ┆ b │
# │ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str │
# ╞═════╪═════════╪══════╡
# │ 1 ┆ null ┆ null │
# │ 2 ┆ null ┆ null │
# │ 3 ┆ null ┆ null │ <-- should successfully join here
# │ 4 ┆ null ┆ null │ <-- should successfully join here
# └─────┴─────────┴──────┘ @ritchie46 unsure if should create new issue. |
I have an example that is somewhat simpler in my opinion, which also returns strange result for original issue L = pl.DataFrame({'a': [1,2]})
R = pl.DataFrame({'b': [3,4,5]})
L.join(R, left_on=pl.col('a') - pl.col('a'), right_on=pl.col('b') - pl.col('b')) # 6 lines as expected, full cross product
L.join(R, left_on=pl.lit(0), right_on=pl.lit(0)) # only 1 line, expected to be the same as previous |
Update 2024-12-27: This still fails, but with a different error:
Update: As per this comment the issue has changed but the title is still relevant. Here is new behavior:
Issue description
A
pl.lit
value apparently matches everything, regardless of value.Reproducible example
Expected behavior
First two records should be null.
Installed versions
The text was updated successfully, but these errors were encountered: