BUG: DataFrame.drop_duplicates confuses NULL bytes #34551

marco-neumann-by · 2020-06-03T14:46:30Z

Code Sample, a copy-pastable example

import pandas as pd
import pandas.testing as pdt

df = pd.DataFrame({"col": ["", "\0"]})
ser = df["col"].copy()

ser_actual = ser.drop_duplicates()
ser_expected = pd.Series(["", "\0"], name="col")
pdt.assert_series_equal(ser_actual, ser_expected)  # passes

df_actual = df.drop_duplicates()
df_expected = pd.DataFrame({"col": ["", "\0"]})
pdt.assert_frame_equal(df_actual, df_expected)  # fails, only a single row left

Problem description

Test fails, esp. note the inconsistent behavior between Series.drop_duplicates and DataFrame.drop_duplicates.

Expected Output

Test passes.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.6.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-33-generic
machine : x86_64
processor :
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.4
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : 5.4.1
hypothesis : None
sphinx : 3.0.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2020-06-04T06:14:14Z

@marco-neumann-jdas Thanks for the report! That's indeed clearly a bug

The difference can also be seen in duplicated:

In [88]: df.duplicated()  
Out[88]: 
0    False
1     True
dtype: bool

In [89]: ser.duplicated()   
Out[89]: 
0    False
1    False
Name: col, dtype: bool

Now, under the hood, it seems that for DataFrame, this is based on factorize algo, and for series a custom duplicated algo:

In [97]: pd.core.algorithms.duplicated(ser)  
Out[97]: array([False, False])

In [98]: pd.core.algorithms.factorize(ser) 
Out[98]: (array([0, 0]), Index([''], dtype='object'))

And here you can see that factorize only sees a single unique value. So I think the bug is inside the factorize code.

And then, for factorize we actually have a string-specialized hash table (while duplicated only has a general object dtype specialization that is used). So for factorize, comparing those two hashtables:

In [99]: from pandas._libs.hashtable import StringHashTable, PyObjectHashTable 

In [100]: table = StringHashTable(len(ser))  

In [101]: table.factorize(ser.values)    
Out[101]: (array([''], dtype=object), array([0, 0]))

In [102]: table = PyObjectHashTable(len(ser)) 

In [103]: table.factorize(ser.values)  
Out[103]: (array(['', '\x00'], dtype=object), array([0, 1]))

indeed shows a difference.

jorisvandenbossche · 2020-06-04T06:16:23Z

Ah, #32993 seems a similar report

MJafarMashhadi · 2020-06-27T05:44:19Z

It goes down to using get_c_string

pandas/pandas/_libs/tslibs/util.pxd

Lines 212 to 213 in 0159cba

    
           cdef inline const char* get_c_string(str py_string) except NULL: 
        
               return get_c_string_buf_and_size(py_string, NULL)

which ignores the returned string length.
I replaced that with PyUnicode_AsUTF8AndSize and got the string length as an integer in _unique method of StringHashTable.
It stores all the values in vecs array and then passes them one by one to kh_get_str which assumes the input string as a C-style null terminated one.

pandas/pandas/_libs/hashtable_class_helper.pxi.in

Lines 822 to 823 in 0159cba

    
           v = vecs[i] 
        
           k = kh_get_str(self.table, v)

The very existence of get_c_string which ignores the string length is the source of all the trouble. Based on it being only used in StringHashTable, I recon it exists because khash ('s kh_get_str) doesn't accept string length as an input.

I couldn't go any further bc I got a bit confused about khash; the definition of kh_get_str is there but I couldn't find its implementation. I would appreciate your input @jorisvandenbossche

Is it an external library or is it a part of pandas? Can we make changes to that? If it is an external library and we cannot change it, what is the best way to go? I am guessing reporting the issue to khash developers and putting some warnings in docs and todo comments in the code about it.

Quick and dirty debugging with a print:

pandas/_libs/hashtable_class_helper.pxi.in:810

try:
    v = get_c_string_buf_and_size(<str>val, &string_length)
except UnicodeEncodeError:
    v = get_c_string_buf_and_size(<str>repr(val), &string_length)
print('what hasher sees:', v, 'what it actually is:',v[:string_length])

Output:

$ python -c 'import pandas as pd; pd.DataFrame({"col": ["a\0Jason", "a\0Kyle"]}).duplicated()'               
what hasher sees: b'a' what it actually is: b'a\x00Jason'
what hasher sees: b'a' what it actually is: b'a\x00Kyle'

josteinbf · 2021-10-14T11:16:27Z

I've seen the same problem with groupby:

import pandas as pd

print(pd._version.get_versions())

assert (pd.Series(['\x00']) != pd.Series([''])).all()           # ok
assert len(pd.Series(['\x00', '']).to_frame('col1').groupby('col1')) == 2   # fails

My pandas version:

{'date': '2021-09-12T11:02:49+0100', 'dirty': False, 'error': None, 'full-revisionid': '73c68257545b5f8530b7044f56647bd2db92e2ba', 'version': '1.3.3'}

Also, numpy has an issue which seems very similar, see numpy/numpy#20118; however, in numpy this is considered a consequence of how numpy stores strings, and not something that will be changed.

cr-perry · 2022-01-16T00:37:34Z

I'm experiencing the same issue with the creation of a pd.MultiIndex. Given two distinct input values that are identical up to their null character, the index maps them to a single code value and they then both end up getting assigned with the first string value.

My research trail led me to factorize, StringHashTable and (the elusive) kh_get_str. Not sure how to proceed but happy to help (and of course, add my +1 for the issue).

My reproduction:

>>> a = pd.MultiIndex.from_tuples([('test\x001',), ('test\x002',)])
>>> print(a.levels, a.codes)
[['test1']] [[0, 0]]

note: pandas 1.0.1

DeanLa · 2022-03-08T16:31:34Z

I'm experiencing something similar with nunique, groupby and value_counts.

s = pd.Series(['foo', 'b\x00a\x00z\x00', 'b\x00az'])
s.nunique() # result: 2
s.value_counts().shape[0] # result: 3
s.to_frame('bar').assign(value=1).groupby('bar').value.count().shape[0] # result: 2

Note that which answer is "correct" is up to debate. What's important here is consistency.

pandas : 1.3.5

Maaslak · 2022-03-10T07:56:03Z

I'm having a similar issue with multiindex. I suppose during deduplication characters after \x00 are not analyzed. As a result names of columns, differing in consecutive characters, are considered to be the same.

>>> list(pd.MultiIndex.from_tuples([('dstrings', '\x00\x01'), ('dstrings', '\x00\xff')]))
[('dstrings', '\x00\x01'), ('dstrings', '\x00\x01')]

MJafarMashhadi · 2022-03-10T15:07:43Z

Anywhere that a string is used in a hash map/table (Indices, groupbys, column names (which are also indices), count unique, etc) this will occur since hashing logic is implemented in C where strings are null-terminated. Refer to my previous comment: #34551 (comment)

simonjayhawkins · 2022-06-11T16:10:29Z

note the inconsistent behavior between Series.drop_duplicates and DataFrame.drop_duplicates

Note that since #45534 the behavior is consistent for a Series and a single column DataFrame, but the issue still persists with a multi-column DataFrame cc @phofl

phofl · 2022-06-13T07:15:25Z

This is because I aligned the implementation of both cases with each other. Multi-column drop_duplicates still runs through the previous code paths.

simonjayhawkins · 2022-06-13T07:54:33Z

And then, for factorize we actually have a string-specialized hash table (while duplicated only has a general object dtype specialization that is used). So for factorize, comparing those two hashtables:

there is an open issue to use StringHashTable for value_counts / duplicated with strings #14860 which should address this inconistency.

phofl · 2022-06-13T15:21:37Z

We are using factorize under the hood for duplicated, which already uses StringHashTable

marco-neumann-by added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 3, 2020

jorisvandenbossche added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 4, 2020

jorisvandenbossche added this to the Contributions Welcome milestone Jun 4, 2020

mroeschke mentioned this issue Aug 7, 2021

df.drop_duplicates() not working as expected #32993

Closed

mroeschke added duplicated duplicated, drop_duplicates Strings String extension data type and string data and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Aug 7, 2021

mzeitlin11 mentioned this issue Apr 15, 2022

BUG: MultiIndex converts "\x00" to "" #46764

Closed

3 tasks

rhshadrach mentioned this issue Apr 27, 2022

BUG: duplicated() returns different results on consecutive runs! #46864

Open

3 tasks

simonjayhawkins added the hashing hash_pandas_object label May 12, 2022

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Nadrons mentioned this issue Jun 19, 2023

BUG: Series.unique() terminates strings prematurely on null Bytes #53720

Closed

3 tasks

asishm mentioned this issue Mar 28, 2025

BUG: \0 null bytes in str not preserved in pandas.CategoricalIndex or pandas.MultiIndex #61189

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.drop_duplicates confuses NULL bytes #34551

BUG: DataFrame.drop_duplicates confuses NULL bytes #34551

marco-neumann-by commented Jun 3, 2020 •

edited by jorisvandenbossche

Loading

INSTALLED VERSIONS

jorisvandenbossche commented Jun 4, 2020

jorisvandenbossche commented Jun 4, 2020

MJafarMashhadi commented Jun 27, 2020

josteinbf commented Oct 14, 2021 •

edited

Loading

cr-perry commented Jan 16, 2022 •

edited

Loading

DeanLa commented Mar 8, 2022 •

edited

Loading

Maaslak commented Mar 10, 2022

MJafarMashhadi commented Mar 10, 2022 •

edited

Loading

simonjayhawkins commented Jun 11, 2022

phofl commented Jun 13, 2022

simonjayhawkins commented Jun 13, 2022

phofl commented Jun 13, 2022

BUG: DataFrame.drop_duplicates confuses NULL bytes #34551

BUG: DataFrame.drop_duplicates confuses NULL bytes #34551

Comments

marco-neumann-by commented Jun 3, 2020 • edited by jorisvandenbossche Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jorisvandenbossche commented Jun 4, 2020

jorisvandenbossche commented Jun 4, 2020

MJafarMashhadi commented Jun 27, 2020

josteinbf commented Oct 14, 2021 • edited Loading

cr-perry commented Jan 16, 2022 • edited Loading

DeanLa commented Mar 8, 2022 • edited Loading

Maaslak commented Mar 10, 2022

MJafarMashhadi commented Mar 10, 2022 • edited Loading

simonjayhawkins commented Jun 11, 2022

phofl commented Jun 13, 2022

simonjayhawkins commented Jun 13, 2022

phofl commented Jun 13, 2022

marco-neumann-by commented Jun 3, 2020 •

edited by jorisvandenbossche

Loading

Output of `pd.show_versions()`

josteinbf commented Oct 14, 2021 •

edited

Loading

cr-perry commented Jan 16, 2022 •

edited

Loading

DeanLa commented Mar 8, 2022 •

edited

Loading

MJafarMashhadi commented Mar 10, 2022 •

edited

Loading