Difference in handling for trailing column separators between lazy and eager csv readers #8240
Closed
2 tasks done
Labels
A-io
Area: reading and writing data
bug
Something isn't working
needs triage
Awaiting prioritization by a maintainer
python
Related to Python Polars
Polars version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Issue description
I seem to have stumbled on a difference in tolerances between read_csv and scan_csv. The data I'm using to familiarize with Polars is the Unified Medical Language System (UMLS). The UMLS files are pipe-delimited but they all have trailing pipes. read_csv doesn't seem to complain here, but trying to use the with_column_names of scan_csv I get an error for providing one too few columns.
thread '' panicked at 'assertion failed:
(left == right)
left:
18
,right:
19
: The length of the new names list should be equal to the original column length', src/lazy/dataframe.rs:262:21note: run with
RUST_BACKTRACE=1
environment variable to display a backtraceHere's an example of the data:
C0006142|ENG|S|L0006142|PF|S0020508|N|A21143327||||MEDLINEPLUS|PT|3|Breast Cancer|0|N|256|
C0006142|ENG|S|L0006142|PF|S0020508|Y|A7756763||M0002909|D001943|MSH|PEP|D001943|Breast Cancer|0|N|256|
C0006142|ENG|S|L0006142|VC|S0415895|N|A0475029||||AOD|ET|0000004579|breast cancer|0|N|256|
C0006142|ENG|S|L0006142|VC|S0415895|N|A0475030||||BI|PT|BI00371|breast cancer|2|N|256|
C0006142|ENG|S|L0006142|VC|S0415895|N|A14014881||232253||MEDCIN|SY|232253|breast cancer|3|N|256|
And the column headers are:
CUI,LAT,TS,LUI,STT,SUI,ISPREF,AUI,SAUI,SCUI,SDUI,SAB,TTY,CODE,STR,SRL,SUPPRESS, and CVF
I have a workaround by adding a placeholder column header to my list but seems odd that the lazy and eager csv parsers lack parity.
Reproducible example
Expected behavior
Given the same CSV,
pl.read_csv
andpl.scan_csv
with the same arguments should succeed and fail identically producing the same resulting dataframe on success.Installed versions
The text was updated successfully, but these errors were encountered: