Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Inconsistent .unique() result for pl.Enum type depending on input data format #19338

Closed
2 tasks done
Sage0614 opened this issue Oct 21, 2024 · 1 comment · Fixed by #20680
Closed
2 tasks done

Inconsistent .unique() result for pl.Enum type depending on input data format #19338

Sage0614 opened this issue Oct 21, 2024 · 1 comment · Fixed by #20680
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@Sage0614
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

enum_type = pl.Enum(["A", "B", "C"])
df = pl.DataFrame(
    data={
        "enum": "A",
    },
    schema_overrides={"enum": enum_type},
)
df2 = pl.DataFrame(
    data=[
        {
            "enum": "A",
        }
    ],
    schema_overrides={"enum": enum_type},
)
print(df)
print("df:")
print(df.select(pl.col("enum").unique()))
print("df2:")
print(df2.select(pl.col("enum").unique()))
pl.show_versions()

Log output

shape: (1, 1)
┌──────┐
│ enum │
│ ---  │
│ enum │
╞══════╡
│ A    │
└──────┘
df:
shape: (1, 1)
┌──────┐
│ enum │
│ ---  │
│ enum │
╞══════╡
│ A    │
└──────┘
df2:
shape: (3, 1)
┌──────┐
│ enum │
│ ---  │
│ enum │
╞══════╡
│ A    │
│ B    │
│ C    │
└──────┘

Issue description

df and df2 are same input with different initialization type, which I believe are valid, when you initialized as df2, the .unique method is not correct. this only happens with pl.Enum type but not pl.String

Expected behavior

unique for df and df2 should both be 'A'

Installed versions

--------Version info---------
Polars:              1.10.0
Index type:          UInt32
Platform:            Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:              3.11.4 (main, Aug 24 2023, 11:18:03) [GCC 11.4.0]
LTS CPU:             False

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               2023.6.0
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.7.2
nest_asyncio         1.5.7
numpy                1.25.2
openpyxl             <not installed>
pandas               2.0.3
pyarrow              13.0.0
pydantic             1.10.12
pyiceberg            <not installed>
sqlalchemy           2.0.20
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@Sage0614 Sage0614 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Oct 21, 2024
@cmdlineluser
Copy link
Contributor

Can reproduce.

pl.DataFrame(data=[{"enum": "A"}], schema={"enum": pl.Enum(["A", "B", "C"])}).unique()
# shape: (3, 1)
# ┌──────┐
# │ enum │
# │ ---  │
# │ enum │
# ╞══════╡
# │ A    │
# │ null │
# │ null │
# └──────┘

It seems this behaviour was introduced in 1.8.2

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants