Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Difference in handling for trailing column separators between lazy and eager csv readers #8240

Closed
2 tasks done
Tshimanga opened this issue Apr 14, 2023 · 0 comments · Fixed by #20680
Closed
2 tasks done
Labels
A-io Area: reading and writing data bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@Tshimanga
Copy link

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

I seem to have stumbled on a difference in tolerances between read_csv and scan_csv. The data I'm using to familiarize with Polars is the Unified Medical Language System (UMLS). The UMLS files are pipe-delimited but they all have trailing pipes. read_csv doesn't seem to complain here, but trying to use the with_column_names of scan_csv I get an error for providing one too few columns.

thread '' panicked at 'assertion failed: (left == right)
left: 18,
right: 19: The length of the new names list should be equal to the original column length', src/lazy/dataframe.rs:262:21
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

Here's an example of the data:

C0006142|ENG|S|L0006142|PF|S0020508|N|A21143327||||MEDLINEPLUS|PT|3|Breast Cancer|0|N|256|
C0006142|ENG|S|L0006142|PF|S0020508|Y|A7756763||M0002909|D001943|MSH|PEP|D001943|Breast Cancer|0|N|256|
C0006142|ENG|S|L0006142|VC|S0415895|N|A0475029||||AOD|ET|0000004579|breast cancer|0|N|256|
C0006142|ENG|S|L0006142|VC|S0415895|N|A0475030||||BI|PT|BI00371|breast cancer|2|N|256|
C0006142|ENG|S|L0006142|VC|S0415895|N|A14014881||232253||MEDCIN|SY|232253|breast cancer|3|N|256|
And the column headers are:

CUI,LAT,TS,LUI,STT,SUI,ISPREF,AUI,SAUI,SCUI,SDUI,SAB,TTY,CODE,STR,SRL,SUPPRESS, and CVF
I have a workaround by adding a placeholder column header to my list but seems odd that the lazy and eager csv parsers lack parity.

Reproducible example

import polars as pl
# Write sample data
with open("temp.csv", "w") as outfile:
    outfile.write("A|B|C|D|")

# load eagerly. this succeeds
eagerly = pl.read_csv(
    "temp.csv",
    separator="|",
    has_header=False,
    new_columns=["A", "B", "C", "D"]
)
print(len(eagerly.columns)) # returns 4

# load lazily. this throws an error!
lazily = pl.scan_csv(
    "temp.csv",
    separator="|",
    has_header=False,
    new_columns=["A", "B", "C", "D"]
).collect()
print(len(eagerly.columns)) # errors!

Expected behavior

Given the same CSV, pl.read_csv and pl.scan_csv with the same arguments should succeed and fail identically producing the same resulting dataframe on success.

Installed versions

---Version info---
Polars: 0.17.2
Index type: UInt32
Platform: Linux-6.0.2-76060002-generic-x86_64-with-glibc2.35
Python: 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0]
---Optional dependencies---
numpy: <not installed>
pandas: <not installed>
pyarrow: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fsspec: <not installed>
matplotlib: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>
@Tshimanga Tshimanga added bug Something isn't working python Related to Python Polars labels Apr 14, 2023
@stinodego stinodego added needs triage Awaiting prioritization by a maintainer A-io Area: reading and writing data labels Jan 13, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
A-io Area: reading and writing data bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants