Skip to content

BUG: pd.Categorical turns all values into NaN #43334

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
2 of 3 tasks
mfstabile opened this issue Aug 31, 2021 · 7 comments · Fixed by #43597
Closed
2 of 3 tasks

BUG: pd.Categorical turns all values into NaN #43334

mfstabile opened this issue Aug 31, 2021 · 7 comments · Fixed by #43597
Labels
Bug Categorical Categorical Data Type Internals Related to non-user accessible pandas implementation Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@mfstabile
Copy link

mfstabile commented Aug 31, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample

import pandas as pd
data = pd.read_excel('titanic.xlsx')

data['Survived'] = data['Survived'].astype('category')
data['Sex'] = data['Sex'].astype('category')

data.Survived.cat.categories = ['No', 'Yes']
data.Sex.cat.categories = ['female','male']

data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)
data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)

print(data.head(3))

Problem description

This code sample reads the popular Kaggle titanic file. When reading the titanic.xlsx file, the following Data Set is generated:

PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 0 3 ... 7.2500 NaN S
1 2 1 1 ... 71.2833 C85 C
2 3 1 3 ... 7.9250 NaN S

When I execute the code above, the result displayed on the terminal is as follows:

PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 NaN 3 ... 7.2500 NaN S
1 2 NaN 1 ... 71.2833 C85 C
2 3 NaN 3 ... 7.9250 NaN S

As can be seen, all values in the "Survived" Series are now NaN. The expected behavior, however, would be for the values to become "Yes" or "No". Strangely, if I invert the penultimate and anti-penultimate lines generating the following code sample:

import pandas as pd
data = pd.read_excel('titanic.xlsx')

data['Survived'] = data['Survived'].astype('category')
data['Sex'] = data['Sex'].astype('category')

data.Survived.cat.categories = ['No', 'Yes']
data.Sex.cat.categories = ['female','male']

data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)
data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)

print(data.head(3))

The result generated by the code sample above is the expected one, as shown below.

PassengerId Survived Pclass ... Fare Cabin Embarked
0 1 No 3 ... 7.2500 NaN S
1 2 Yes 1 ... 71.2833 C85 C
2 3 Yes 3 ... 7.9250 NaN S

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 5f648bf
python : 3.9.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Tue Jun 22 19:49:55 PDT 2021; root:xnu-6153.141.35~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : pt_BR.UTF-8
LOCALE : pt_BR.UTF-8

pandas : 1.3.2
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.3
setuptools : 57.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.4.15
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None

@mfstabile mfstabile added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 31, 2021
@phofl
Copy link
Member

phofl commented Sep 1, 2021

Hi, could you please post a minimal and reproducible example? See https://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

@simonjayhawkins simonjayhawkins added the Needs Info Clarification about behavior needed to assess issue label Sep 1, 2021
@mfstabile
Copy link
Author

Hello. I hope the following code sample is helpful.

import pandas as pd

data = pd.DataFrame({'Survived': [1, 0, 1],
                   'Sex': [0, 1, 1]
                   })

data['Survived'] = data['Survived'].astype('category')
data['Sex'] = data['Sex'].astype('category')

data.Survived.cat.categories = ['No', 'Yes']
data.Sex.cat.categories = ['female','male']

data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)
data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)

print(data)

The code above produces the following output:

Survived Sex
0 NaN female
1 NaN male
2 NaN male

If I invert the pd.Categorical lines:

import pandas as pd

data = pd.DataFrame({'Survived': [1, 0, 1],
                   'Sex': [0, 1, 1]
                   })

data['Survived'] = data['Survived'].astype('category')
data['Sex'] = data['Sex'].astype('category')

data.Survived.cat.categories = ['No', 'Yes']
data.Sex.cat.categories = ['female','male']

data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)
data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)

print(data)

The code above produces the following output:

Survived Sex
0 Yes NaN
1 No NaN
2 Yes NaN

Changing the Sex Series from int to string:

import pandas as pd

data = pd.DataFrame({'Survived': [1, 0, 1],
                   'Sex': ['female', 'male', 'male']
                   })

data['Survived'] = data['Survived'].astype('category')
data['Sex'] = data['Sex'].astype('category')

data.Survived.cat.categories = ['No', 'Yes']
data.Sex.cat.categories = ['female','male']

data.Sex = pd.Categorical(data.Sex, categories=['female','male'], ordered=False)
data.Survived = pd.Categorical(data.Survived, categories=['No', 'Yes'], ordered=False)

print(data)

The code above produces the expected output:

Survived Sex
0 Yes female
1 No male
2 Yes male

@thisisamardeep
Copy link

Hi All,

I am picking up this task will update soon.Around end of sep.

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Sep 4, 2021
@simonjayhawkins simonjayhawkins removed the Needs Info Clarification about behavior needed to assess issue label Sep 4, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3.3 milestone Sep 4, 2021
@simonjayhawkins simonjayhawkins added Regression Functionality that used to work in a prior pandas version Categorical Categorical Data Type Internals Related to non-user accessible pandas implementation and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2021
@simonjayhawkins
Copy link
Member

simonjayhawkins commented Sep 4, 2021

code sample worked in 1.2.5

first bad commit: [c68b605] PERF: cache_readonly for Block properties (#40620)

cc @jbrockmendel

xref #43232 for similar issue

@simonjayhawkins
Copy link
Member

fix suggested in #43232 (comment) also fixes this issue

diff --git a/pandas/core/internals/blocks.py b/pandas/core/internals/blocks.py
index e3fcff1557..ee38b4ce56 100644
--- a/pandas/core/internals/blocks.py
+++ b/pandas/core/internals/blocks.py
@@ -331,7 +331,6 @@ class Block(PandasObject):
     def shape(self) -> Shape:
         return self.values.shape
 
-    @final
     @cache_readonly
     def dtype(self) -> DtypeObj:
         return self.values.dtype
@@ -1867,6 +1866,10 @@ class CategoricalBlock(ExtensionBlock):
     # this Block type is kept for backwards-compatibility
     __slots__ = ()
 
+    @property
+    def dtype(self) -> DtypeObj:
+        return self.values.dtype
+
 
 # -----------------------------------------------------------------
 # Constructor Helpers

@Ljupco7
Copy link

Ljupco7 commented Jul 7, 2023

Can anyone confirm this bug has been fixed?
I am introducing category as a dtype to a few of my dataframe columns to improve performance.
I have the following code and I keep getting nan values for actual strings in a dataframe column.
This is the way I set a column dtype in a dataframe to be of category type (I did not find another way to do this. If there is a better solution please let me know).

acc_domain_values = ['ACTIVE, INACTIVE', 'BLOCKED', 'UNKNOWN']
acc_code_Dtype = pd.api.types.CategoricalDtype(categories= acc_domain_values , ordered=False)
df = pd.read_csv(file_name,
                 engine='c',
                 delimiter=';', 
                 on_bad_lines='warn',
                 low_memory=True,
                 dtype={
                              'ACC_CODE': acc_code_Dtype ,
}

My purpose in following code is to get every value in that column that does not belong to: acc_domain_values.

acc_code_incorrect_val_list = []
acc_code_incorrect_val_list = df.query('ACC_CODE not in @acc_domain_values')

What is weird is that I get the following results in the console when I print the contents of acc_code_incorrect_val_list like this:

vari = df['ACC_CODE'].unique()

for element in vari :
    print(element)

Results:

nan
BLOCKED
UNKNOWN

It seems as though pandas looks at these two from the list above acc_domain_values, as NaN (nan) values:
'ACTIVE, INACTIVE'

@jbrockmendel
Copy link
Member

The bug in the OP should be fixed, yes. If you think the problem persists, please open a new issue with a reproducible example (i.e. without read_csv)

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Bug Categorical Categorical Data Type Internals Related to non-user accessible pandas implementation Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants