Skip to content

FR: Allow duplicate column names in pandas.read_csv #19383

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
njvack opened this issue Jan 24, 2018 · 5 comments
Closed

FR: Allow duplicate column names in pandas.read_csv #19383

njvack opened this issue Jan 24, 2018 · 5 comments
Labels
Duplicate Report Duplicate issue or pull request

Comments

@njvack
Copy link

njvack commented Jan 24, 2018

Right now, pandas's read_csv() supports forcing column names read from CSV data to be unique:

>>> import pandas as pd
p>>> pd.__version__
u'0.20.3'
>>> from StringIO import StringIO
>>> csv_data = """a,a,b
... 1,2,3
... 4,5,6"""
>>> df = pd.read_csv(StringIO(csv_data))
>>> df.columns.tolist()
['a', 'a.1', 'b']

The documentation suggests that passing mangle_dupe_cols=False to read_csv() will change this behavior to one where it'll overwrite data on load. That doesn't seem to be implemented as of this version:

>>> df = pd.read_csv(StringIO(csv_data), mangle_dupe_cols=False)
# snip
ValueError: Setting mangle_dupe_cols=False is not supported yet

However! Pandas doesn't fundamentally disallow duplicate column names. There's a simple (but somewhat convoluted) trick to get duplicate columns from a CSV file into your headers:

>>> df = pd.read_csv(StringIO(csv_data), header=None)
>>> df.columns = df.iloc[0]
>>> df = df.reindex(df.index.drop(0)).reset_index(drop=True)
>>> df
0  a  a  b
0  1  2  3
1  4  5  6

(Then again, maybe this isn't so simple; there's a funny "0" in there with the column headings and that seems weird.)

Problem description

I would like a native way to read CSV files with repeated headers. In my application, it's literally so I can warn people about duplicated column headers. Yes, I could use python's built-in csv module for this, but then I'm using two methods to read CSV files and it gets weird.

Since mangle_dupe_cols=False is not yet implemented, I might propose this behavior in this case.

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 17.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.20.3
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.2.7
Cython: None
numpy: 1.13.1
scipy: None
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@njvack
Copy link
Author

njvack commented Jan 25, 2018

Probably a slightly better "turn the first row into column headers" incantation:

df = df.rename(columns=df.iloc[0], copy=False).iloc[1:].reset_index(drop=True)

Also, I've upgraded to pandas 0.22 and all this works in that version as well.

njvack added a commit to uwmadison-chm/masterfile that referenced this issue Jan 25, 2018
Checks to see if any columns (other than the id column) are duplicated,
either in one file or across files.

The behind-the-scenes change that *could* have reprecussions is that
this changes how we're reading the CSV files into dataframes. pandas
mangles duplicated column names when reading CSV files; however, we can
get around this by having pandas not interpret the header row and
instead, read it as a normal row and then use it as column headers.

Yes, this is silly. See pandas-dev/pandas#19383
@chris-b1
Copy link
Contributor

Thanks for the report, this is a duplicate of #13262. A PR to support this would be welcome if you're interested!

@chris-b1 chris-b1 added the Duplicate Report Duplicate issue or pull request label Jan 25, 2018
@chris-b1 chris-b1 added this to the No action milestone Jan 25, 2018
@njvack
Copy link
Author

njvack commented Jan 25, 2018

Thanks — I suspected I wasn't the first to report this, but somehow failed to find that issue. I'll try and work on a PR for this (it shouldn't be too hard?) once I can figure out how pandas's source works some; this is a pretty intimidating codebase...

@grofte
Copy link

grofte commented Apr 8, 2018

Since pandas does support non-unique column names it would be really great if pandas had some kind of function to warn about them. Maybe in df.info() and / or other functions commonly used when running into trouble.

Example of crashing: trying to use seaborn boxplot on a dataframe with duplicate column names. The crash comes from pandas but I don't know what the actual crash stems from.

@claresloggett
Copy link

While this doesn't directly address the issue, for the specific use-case that @njvack is encountering - wanting to warn the user about duplicate columns - I've resorted to a pre-load of the header row with

pd.read_csv(myfile, header=None, nrows=1).iloc[0,:].value_counts()

and then any values with count > 1 can be reported to the user.

As a general issue though, mangle_dupe_cols still needs implementation by the looks of it (as of Pandas version 1.4.1).

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

4 participants