Skip to content

BUG: na_values dict form not working on index column #57547

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
2 of 3 tasks
anna-intellegens opened this issue Feb 21, 2024 · 6 comments · Fixed by #57965
Closed
2 of 3 tasks

BUG: na_values dict form not working on index column #57547

anna-intellegens opened this issue Feb 21, 2024 · 6 comments · Fixed by #57965
Assignees
Labels
Bug IO CSV read_csv, to_csv

Comments

@anna-intellegens
Copy link

anna-intellegens commented Feb 21, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from io import StringIO

from pandas._libs.parsers import STR_NA_VALUES
import pandas as pd

file_contents = """,x,y
MA,1,2
NA,2,1
OA,,3
"""

default_nan_values = STR_NA_VALUES | {"squid"}
names = [None, "x", "y"]
nan_mapping = {name: default_nan_values for name in names}
dtype = {0: "object", "x": "float32", "y": "float32"}

pd.read_csv(
    StringIO(file_contents),
    index_col=0,
    header=0,
    engine="c",
    dtype=dtype,
    names=names,
    na_values=nan_mapping,
    keep_default_na=False,
)

Issue Description

I'm trying to find a way to read in an index column as exact strings, but read in the rest of the columns as NaN-able numbers or strings. The dict form of na_values seems to be the only way implied in the documentation to allow this to happen, however, when I try this, it errors with the message:

Traceback (most recent call last):
  File ".../test.py", line 17, in <module>
    pd.read_csv(
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1024, in read_csv
    return _read(filepath_or_buffer, kwds)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 624, in _read
    return parser.read(nrows)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1921, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 333, in read
    index, column_names = self._make_index(date_data, alldata, names)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 372, in _make_index
    index = self._agg_index(simple_index)
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 504, in _agg_index
    arr, _ = self._infer_types(
  File ".../venv/lib/python3.10/site-packages/pandas/io/parsers/base_parser.py", line 744, in _infer_types
    na_count = parsers.sanitize_objects(values, na_values)
TypeError: Argument 'na_values' has incorrect type (expected set, got dict)

This is unhelpful, as the docs imply this should work, and I can't find any other way to turn off nan detection in the index column without disabling it in the rest of the table (which is a hard requirement)

Expected Behavior

The pandas table should be read without error, leading to a pandas table a bit like the following:

       x    y
MA   1.0  2.0
NA   2.0  1.0
OA   NaN  3.0

Installed Versions

This has been tested on three versions of pandas v1.5.2, v2.0.2, and v2.2.0, all with similar results.

INSTALLED VERSIONS ------------------ commit : fd3f571 python : 3.10.11.final.0 python-bits : 64 OS : Linux OS-release : 6.5.0-18-generic Version : #18~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Feb 7 11:40:03 UTC 2 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8

pandas : 2.2.0
numpy : 1.26.3
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 69.0.3
pip : 23.2.1
Cython : None
pytest : 7.4.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.3
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : 0.58.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.11.4
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@anna-intellegens anna-intellegens added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 21, 2024
@techSavvy1001
Copy link

import io
import pandas as pd

file_contents = """
,x,y
MA,1,2
NA,2,1
OA,,3
"""

default_nan_values = set(["NA", "squid"])
names = [None, "x", "y"]
nan_mapping = {name: default_nan_values for name in names}
dtype = {0: "object", "x": "float32", "y": "float32"}

try:
df = pd.read_csv(
io.StringIO(file_contents),
index_col=0,
header=0,
engine="c",
dtype=dtype,
names=names,
na_values=nan_mapping,
keep_default_na=True,
)
print(df)
except Exception as e:
print(f"Error occurred: {e}")

@rhshadrach
Copy link
Member

Thanks for the report, confirmed on main. Further investigations and PRs to fix are welcome!

@rhshadrach rhshadrach added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 27, 2024
@tomhoq
Copy link
Contributor

tomhoq commented Mar 5, 2024

take

@asishm
Copy link
Contributor

asishm commented Mar 5, 2024

replacing the None in names with anything else (string) works fine.

@tomhoq
Copy link
Contributor

tomhoq commented Mar 17, 2024

@thomas-intellegens Sorry to bother, but in the issue post you mention that

The dict form of na_values seems to be the only way implied in the documentation to allow having no na values on a specific column

In case you might remember, was the documentation this one?

Because otherwise, I cannot find, in the docs, where such property is mentioned.

Thank you

@anna-intellegens
Copy link
Author

In case you might remember, was the documentation this one?

Yeah, this was the section I was reading. Many thanks for taking a look at this

mroeschke pushed a commit that referenced this issue Apr 9, 2024
BUG: Na_values dict not working on index column (#57547)

* fix base_parser not setting col_na_values when na_values is a dict containing None

* fix python_parser applying na_values in a column None

* add unit test to test_na_values.py;

* update whatsnew.
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this issue May 7, 2024
pandas-dev#57965)

BUG: Na_values dict not working on index column (pandas-dev#57547)

* fix base_parser not setting col_na_values when na_values is a dict containing None

* fix python_parser applying na_values in a column None

* add unit test to test_na_values.py;

* update whatsnew.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants