Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[BUG] read_csv fails to correctly handle misplaced quotes #2398

Open
ayushdg opened this issue Jul 25, 2019 · 5 comments
Open

[BUG] read_csv fails to correctly handle misplaced quotes #2398

ayushdg opened this issue Jul 25, 2019 · 5 comments
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@ayushdg
Copy link
Member

ayushdg commented Jul 25, 2019

Describe the bug
Often csv files have misplaced quotes and sometime there is a quotation mark as a part of one of the string fields. This should not be interpreted as a quotation mark indicating that a field has delimiters in it and therefore uses ".
Steps/Code to reproduce bug

import cudf
df = cudf.read_csv('quoting.csv')
df.to_pandas()
	a	b	c	d	e
0	1	192.16.1.2	/abc/def/ghi	200	1
1	2	nvidia.com	/abc/def/ghi.html	200	1
2	3	0.0.0.0	/images",500,1\n4,0.0.0.0,/abc/def,200,2\n5,ra...	-1	-1

Expected behavior

import pandas as pd
df = pd..read_csv('quoting.csv')
df
	a	b	c	d	e
0	1	192.16.1.2	/abc/def/ghi	200	1
1	2	nvidia.com	/abc/def/ghi.html	200	1
2	3	0.0.0.0	/images"	500	1
3	4	0.0.0.0	/abc/def	200	2
4	5	rapids.ai	/images	200	2

Workaround: Use cudf.read_csv with quoting=3. Pandas gives correct output for all quotation modes.

Environment overview (please complete the following information)

  • Environment location: Docker nightly
  • Method of cuDF install: Docker
    • docker pull rapidsai/rapidsai-nightly:0.9-cuda10.0-runtime-ubuntu16.04-gcc5-py3.7

Additional context
I might be wrong but maybe checking if opening quote exists just after delimiter and the second one before another delimiter might be the way? (Just a guess)

quoting.csv.zip

@ayushdg ayushdg added Needs Triage Need team to review and classify bug Something isn't working labels Jul 25, 2019
@randerzander randerzander added cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Jul 25, 2019
@mjsamoht
Copy link
Contributor

mjsamoht commented Aug 9, 2019

This is essentially the same problem as described in #873.

There isn't really a "correct" way of handling misplaced quotes. For every example where pandas seemingly returns something more useful, you can construct an example where pandas returns the "wrong" data.

We could decide for cuIO to match pandas' behavior, with all the good and bad side effects.

Please take a look at the following to learn more how pandas behaves:
#873 (comment)

@harrism
Copy link
Member

harrism commented May 11, 2020

@kkraus14 @ayushdg @OlivierNV what do you think we should with this issue? Should we aim to match Pandas, or not?

@kkraus14
Copy link
Collaborator

The above situation seems like something that would be nice to handle, but in general we shouldn't try to handle every edge case / error case of Pandas here. We should do what's logical / what's the best end user experience.

Alternatively, instead of returning the correct results as above, if we were able to clearly and loudly error saying "hey there was an unclosed quotation on line 123 at character 87, try setting quoting=3" that would be perfectly good as well.

@OlivierNV
Copy link
Contributor

OlivierNV commented May 11, 2020

IMO, we should try to match pandas if we can. The new csv implementation should be able to handle the case above when looking for \n, but the data conversion stage doesn't, so it will have to do something similar (basically automatically ignoring quoting if a field doesn't start with a quote)
[Edit] Although priority for malformed content should probably be lower than for correct cases, like basic escapechar support for the delimiter.

@vyasr
Copy link
Contributor

vyasr commented Jul 22, 2022

As discussed in #873, these are edge cases where fixing them will break other (valid) use cases. The quoting=3 WAR is the correct solution. The only actionable before closing this issue is to determine whether it might be feasible to throw a clear error for unclosed quotations like Keith mentions above. @vuule I don't know much about the CSV reader internals, do you think that would be feasible?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

8 participants