Skip to content

BUG: read multi-index column csv with index_col=False borks #6051

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
jreback opened this issue Jan 23, 2014 · 21 comments · Fixed by #30327
Closed

BUG: read multi-index column csv with index_col=False borks #6051

jreback opened this issue Jan 23, 2014 · 21 comments · Fixed by #30327
Labels
good first issue IO CSV read_csv, to_csv Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

http://stackoverflow.com/questions/21318865/read-multi-index-on-the-columns-from-csv-file

@hayd
Copy link
Contributor

hayd commented Jan 23, 2014

For convenience, here's test case:

from StringIO import StringIO
s1 = '''Male, Male, Male, Female, Female
R, R, L, R, R
.86, .67, .88, .78, .81'''

s2 = '''Male, Male, Male, Female, Female
R, R, L, R, R
.86, .67, .88, .78, .81
.86, .67, .88, .78, .82'''

In [11]: pd.read_csv(StringIO(s1), header=[0, 1])
Out[11]: 
Empty DataFrame
Columns: [(Male, R), ( Male,  R), ( Male,  L), ( Female,  R), ( Female,  R)]
Index: []

In [12]: pd.read_csv(StringIO(s2), header=[0, 1])
Out[12]: 
   (Male, R)  ( Male,  R)  ( Male,  L)  ( Female,  R)  ( Female,  R)
0       0.86         0.67         0.88           0.82           0.82

seems to skip first row after header.

Note: columns tuplized as wanted to see if this was also a bug in 0.12.

@jreback
Copy link
Contributor Author

jreback commented Jan 23, 2014

@hayd could try tupleize_cols=True and and see if it works

@TomAugspurger
Copy link
Contributor

Mine is skipping all the subsequent rows (s1 and s2 are from Andy's example):

In [5]: pd.read_csv(StringIO(s1), header=[0, 1])
Out[5]: 
Empty DataFrame
Columns: [(Male, R), ( Male,  R), ( Male,  L), ( Female,  R), ( Female,  R)]
Index: []

[0 rows x 5 columns]

In [6]: pd.read_csv(StringIO(s2), header=[0, 1])
Out[6]: 
Empty DataFrame
Columns: [(Male, R), ( Male,  R), ( Male,  L), ( Female,  R), ( Female,  R)]
Index: []

[0 rows x 5 columns]

@hayd
Copy link
Contributor

hayd commented Jan 23, 2014

@jreback maybe was just being thick about tuplize columns (forgot repr of mi) is working fine, OT though.

There is a change in 0.12 and 0.13. I see what @TomAugspurger sees in 0.13.

@jreback
Copy link
Contributor Author

jreback commented Jan 23, 2014

The problem is that it is confused by the lack of an index_col I think; specify index_col=0 actually works (but kills the first value....)

@hayd
Copy link
Contributor

hayd commented Jan 23, 2014

Seems like the column after the header is being used for the naming of the index?

In [11]: pd.read_csv(StringIO(s1), header=[0, 1], index_col=0)
Out[11]: 
Empty DataFrame
Columns: [( Male,  R), ( Male,  L), ( Female,  R), ( Female,  R)]
Index: []

In [12]: pd.read_csv(StringIO(s2), header=[0, 1], index_col=0)
Out[12]: 
      ( Male,  R)  ( Male,  L)  ( Female,  R)  ( Female,  R)
.86                                                         
0.86         0.67         0.88           0.82           0.82

@jreback
Copy link
Contributor Author

jreback commented Jan 23, 2014

@hayd yes if it can, but this is where the index_col matters, it is a heuristic (and maybe wrong in this case)

@waitingkuo
Copy link
Contributor

Seems the problem is caused by the duplicated columns ( Female, R). If you modify the second row to a, b, c, d, e, the function works normally. Is it a bug? Or should we throw some exception while there're duplicated multi-columns?

@jreback
Copy link
Contributor Author

jreback commented Jan 24, 2014

@waitingkuo hmm a duplicated multi index is technically valid (prob not tested very well though)
I think this my be related to index_col - it basically has to try to guess if their are names present or not

want to dig in?

@waitingkuo
Copy link
Contributor

For duplicated single column, some sequence numbers would be append:

In [13]: pd.read_csv(StringIO('R,R,L,R,R\n1,2,3,4,5'))
Out[13]: 
   R  R.1  L  R.2  R.3
0  1    2  3    4    5

[1 rows x 5 columns]

According to this logic, the multi-column one

Male,Male,Male,Female,Female
R,R,L,R,R

should be converted to

Male,Male,Male,Female,Female
R,R.1,L,R,R.1

Does it make sense?

@hayd
Copy link
Contributor

hayd commented Jan 28, 2014

That looks correct. However there is also a flag for this, mangle_dupe_cols:

In [7]: pd.read_csv(StringIO('R,R,L,R,R\n1,2,3,4,5'), mangle_dupe_cols=False)
Out[7]: 
   R  R  L  R  R
0  5  5  3  5  5

[1 rows x 5 columns]

@hayd
Copy link
Contributor

hayd commented Jan 28, 2014

Well... er that's a bug!

@waitingkuo
Copy link
Contributor

Things also go wrong when we set header as a list

In [4]: pd.read_csv(StringIO('R,R,L,R,R\n1,2,3,4,5'), header=[0])
Out[4]:  
Empty DataFrame
Columns: [R, R, L, R, R]
Index: []

[0 rows x 5 columns]

@waitingkuo
Copy link
Contributor

I've figured out the problem and fixed it in python2. However, I got stuck in python3. Can anyone who have experience in python3 give me a hand?

My commit
waitingkuo@b969e96

My Travis Failed build
https://travis-ci.org/waitingkuo/pandas/jobs/17788195

@jreback
Copy link
Contributor Author

jreback commented Jan 29, 2014

use lzip instead of zip
it's imported from pandas.compat

@waitingkuo
Copy link
Contributor

Thank you for helping :)
I've made the pull request

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014
@jreback jreback modified the milestones: 0.14.1, 0.15.0 May 30, 2014
@jreback jreback modified the milestones: 0.15.0, 0.14.1, 0.15.1 Jun 30, 2014
@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 4, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@jreback jreback removed this from the 0.16.0 milestone Mar 6, 2015
@Licht-T
Copy link
Contributor

Licht-T commented Nov 9, 2017

@jreback @gfyoung This seems fixed. We need to close this issue.

@jreback
Copy link
Contributor Author

jreback commented Nov 9, 2017

are there tests covering this case? if not can u put one up

@Licht-T
Copy link
Contributor

Licht-T commented Nov 9, 2017

@jreback Okay. I'll check.

@Licht-T
Copy link
Contributor

Licht-T commented Nov 10, 2017

@jreback Seems that #17060 fixed this bug. But there is no test for multi-index columns.

@jreback
Copy link
Contributor Author

jreback commented Nov 10, 2017

cc @gfyoung

@mroeschke mroeschke added Testing pandas testing functions or related to the test suite good first issue and removed Bug labels Jul 6, 2018
@mroeschke mroeschke added Needs Tests Unit test(s) needed to prevent regressions and removed Testing pandas testing functions or related to the test suite labels Oct 6, 2019
@jreback jreback modified the milestones: Contributions Welcome, 1.0 Dec 20, 2019
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
good first issue IO CSV read_csv, to_csv Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
6 participants