-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
FR: Allow duplicate column names in pandas.read_csv #19383
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
Probably a slightly better "turn the first row into column headers" incantation: df = df.rename(columns=df.iloc[0], copy=False).iloc[1:].reset_index(drop=True) Also, I've upgraded to pandas 0.22 and all this works in that version as well. |
Checks to see if any columns (other than the id column) are duplicated, either in one file or across files. The behind-the-scenes change that *could* have reprecussions is that this changes how we're reading the CSV files into dataframes. pandas mangles duplicated column names when reading CSV files; however, we can get around this by having pandas not interpret the header row and instead, read it as a normal row and then use it as column headers. Yes, this is silly. See pandas-dev/pandas#19383
Thanks for the report, this is a duplicate of #13262. A PR to support this would be welcome if you're interested! |
Thanks — I suspected I wasn't the first to report this, but somehow failed to find that issue. I'll try and work on a PR for this (it shouldn't be too hard?) once I can figure out how pandas's source works some; this is a pretty intimidating codebase... |
Since pandas does support non-unique column names it would be really great if pandas had some kind of function to warn about them. Maybe in df.info() and / or other functions commonly used when running into trouble. Example of crashing: trying to use seaborn boxplot on a dataframe with duplicate column names. The crash comes from pandas but I don't know what the actual crash stems from. |
While this doesn't directly address the issue, for the specific use-case that @njvack is encountering - wanting to warn the user about duplicate columns - I've resorted to a pre-load of the header row with
and then any values with count > 1 can be reported to the user. As a general issue though, |
Right now, pandas's read_csv() supports forcing column names read from CSV data to be unique:
The documentation suggests that passing
mangle_dupe_cols=False
toread_csv()
will change this behavior to one where it'll overwrite data on load. That doesn't seem to be implemented as of this version:However! Pandas doesn't fundamentally disallow duplicate column names. There's a simple (but somewhat convoluted) trick to get duplicate columns from a CSV file into your headers:
(Then again, maybe this isn't so simple; there's a funny "0" in there with the column headings and that seems weird.)
Problem description
I would like a native way to read CSV files with repeated headers. In my application, it's literally so I can warn people about duplicated column headers. Yes, I could use python's built-in
csv
module for this, but then I'm using two methods to read CSV files and it gets weird.Since
mangle_dupe_cols=False
is not yet implemented, I might propose this behavior in this case.Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 17.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.20.3
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.2.7
Cython: None
numpy: 1.13.1
scipy: None
xarray: None
IPython: 5.4.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: