BUG: na_values dict form not working on index column #57547

anna-intellegens · 2024-02-21T12:06:45Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from io import StringIO

from pandas._libs.parsers import STR_NA_VALUES
import pandas as pd

file_contents = """,x,y
MA,1,2
NA,2,1
OA,,3
"""

default_nan_values = STR_NA_VALUES | {"squid"}
names = [None, "x", "y"]
nan_mapping = {name: default_nan_values for name in names}
dtype = {0: "object", "x": "float32", "y": "float32"}

pd.read_csv(
    StringIO(file_contents),
    index_col=0,
    header=0,
    engine="c",
    dtype=dtype,
    names=names,
    na_values=nan_mapping,
    keep_default_na=False,
)

Issue Description

I'm trying to find a way to read in an index column as exact strings, but read in the rest of the columns as NaN-able numbers or strings. The dict form of na_values seems to be the only way implied in the documentation to allow this to happen, however, when I try this, it errors with the message:

Traceback (most recent call last):
  File ".../test.py", line 17, in <module>
    pd.read_csv(
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1024, in read_csv
    return _read(filepath_or_buffer, kwds)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 624, in _read
    return parser.read(nrows)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1921, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 333, in read
    index, column_names = self._make_index(date_data, alldata, names)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 372, in _make_index
    index = self._agg_index(simple_index)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 504, in _agg_index
    arr, _ = self._infer_types(
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 744, in _infer_types
    na_count = parsers.sanitize_objects(values, na_values)
TypeError: Argument 'na_values' has incorrect type (expected set, got dict)

This is unhelpful, as the docs imply this should work, and I can't find any other way to turn off nan detection in the index column without disabling it in the rest of the table (which is a hard requirement)

Expected Behavior

The pandas table should be read without error, leading to a pandas table a bit like the following:

       x    y
MA   1.0  2.0
NA   2.0  1.0
OA   NaN  3.0

Installed Versions

This has been tested on three versions of pandas v1.5.2, v2.0.2, and v2.2.0, all with similar results.

INSTALLED VERSIONS ------------------ commit : fd3f571 python : 3.10.11.final.0 python-bits : 64 OS : Linux OS-release : 6.5.0-18-generic Version : #18~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb 7 11:40:03 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 2.2.0
numpy : 1.26.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 69.0.3
pip : 23.2.1
Cython : None
pytest : 7.4.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : 0.58.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

techSavvy1001 · 2024-02-22T06:01:18Z

import io
import pandas as pd

file_contents = """
,x,y
MA,1,2
NA,2,1
OA,,3
"""

default_nan_values = set(["NA", "squid"])
names = [None, "x", "y"]
nan_mapping = {name: default_nan_values for name in names}
dtype = {0: "object", "x": "float32", "y": "float32"}

try:
df = pd.read_csv(
io.StringIO(file_contents),
index_col=0,
header=0,
engine="c",
dtype=dtype,
names=names,
na_values=nan_mapping,
keep_default_na=True,
)
print(df)
except Exception as e:
print(f"Error occurred: {e}")

rhshadrach · 2024-02-27T03:33:34Z

Thanks for the report, confirmed on main. Further investigations and PRs to fix are welcome!

tomhoq · 2024-03-05T08:33:05Z

take

asishm · 2024-03-05T18:07:08Z

replacing the None in names with anything else (string) works fine.

tomhoq · 2024-03-17T19:03:04Z

@thomas-intellegens Sorry to bother, but in the issue post you mention that

The dict form of na_values seems to be the only way implied in the documentation to allow having no na values on a specific column

In case you might remember, was the documentation this one?

Because otherwise, I cannot find, in the docs, where such property is mentioned.

Thank you

anna-intellegens · 2024-03-18T13:43:19Z

In case you might remember, was the documentation this one?

Yeah, this was the section I was reading. Many thanks for taking a look at this

BUG: Na_values dict not working on index column (#57547) * fix base_parser not setting col_na_values when na_values is a dict containing None * fix python_parser applying na_values in a column None * add unit test to test_na_values.py; * update whatsnew.

pandas-dev#57965) BUG: Na_values dict not working on index column (pandas-dev#57547) * fix base_parser not setting col_na_values when na_values is a dict containing None * fix python_parser applying na_values in a column None * add unit test to test_na_values.py; * update whatsnew.

anna-intellegens added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 21, 2024

rhshadrach added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 27, 2024

github-actions bot assigned tomhoq Mar 5, 2024

tomhoq mentioned this issue Mar 22, 2024

BUG: Fix na_values dict not working on index column (#57547) #57965

Merged

5 tasks

mroeschke closed this as completed in #57965 Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: na_values dict form not working on index column #57547

BUG: na_values dict form not working on index column #57547

anna-intellegens commented Feb 21, 2024 •

edited

Loading

techSavvy1001 commented Feb 22, 2024

Uh oh!

rhshadrach commented Feb 27, 2024

Uh oh!

tomhoq commented Mar 5, 2024

Uh oh!

asishm commented Mar 5, 2024

Uh oh!

tomhoq commented Mar 17, 2024 •

edited

Loading

Uh oh!

anna-intellegens commented Mar 18, 2024

Uh oh!

Uh oh!

BUG: na_values dict form not working on index column #57547

BUG: na_values dict form not working on index column #57547

Comments

anna-intellegens commented Feb 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

techSavvy1001 commented Feb 22, 2024

Uh oh!

rhshadrach commented Feb 27, 2024

Uh oh!

tomhoq commented Mar 5, 2024

Uh oh!

asishm commented Mar 5, 2024

Uh oh!

tomhoq commented Mar 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anna-intellegens commented Mar 18, 2024

Uh oh!

anna-intellegens commented Feb 21, 2024 •

edited

Loading

tomhoq commented Mar 17, 2024 •

edited

Loading