Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[FEA] csv_reader_options to read empty strings as blank (i.e. ""), not null. #12145

Open
mythrocks opened this issue Nov 15, 2022 · 3 comments
Open
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@mythrocks
Copy link
Contributor

When cudf::io::read_csv() encounters two consecutive field delimiters within a row, it deems the corresponding string column value as null. E.g.:

a,,c
d,,f

Reading the input above via read_csv() produces rows {a,null,c} and {d,null,f}. This is conformant with Spark's CSV reader (and presumably Pandas).

It would be useful if the column value could be optionally interpreted as an empty string ("") instead. This would permit support for reading Hive delimited text, where empty strings are empty by default.

@davidwendt
Copy link
Contributor

I believe you can post process this with cudf::replace_nulls with an empty string https://docs.rapids.ai/api/libcudf/stable/group__transformation__replace.html#gad359a898c2b11e70c3e33720259c5596

Technically you could also just remove the null mask. This assumes that null entries are created as empty strings with a corresponding validity bit set to 0.

@mythrocks
Copy link
Contributor Author

mythrocks commented Nov 15, 2022

you can post process this with cudf::replace_nulls with an empty string

The snag there is that we can't then differentiate between a legitimately null string and an empty one.
Consider that we have csv_reader_options.set_na_values({"\N"}) as is common in Spark/Hive. The second and third fields of the following input row cannot be represented via CUDF columns:

First,,\N,Last

The Hive format would treat the second field as empty (""), and the third as null.

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Nov 19, 2022
@vuule
Copy link
Contributor

vuule commented Dec 7, 2022

This is surprising, I would expect set_na_values to impact this behavior. Checking if this is a bug.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

4 participants