Skip to content

BUG: DataFrame.drop_duplicates confuses NULL bytes #34551

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
marco-neumann-by opened this issue Jun 3, 2020 · 12 comments
Open

BUG: DataFrame.drop_duplicates confuses NULL bytes #34551

marco-neumann-by opened this issue Jun 3, 2020 · 12 comments
Labels
Bug duplicated duplicated, drop_duplicates hashing hash_pandas_object Strings String extension data type and string data

Comments

@marco-neumann-by
Copy link

marco-neumann-by commented Jun 3, 2020

Code Sample, a copy-pastable example

import pandas as pd
import pandas.testing as pdt

df = pd.DataFrame({"col": ["", "\0"]})
ser = df["col"].copy()

ser_actual = ser.drop_duplicates()
ser_expected = pd.Series(["", "\0"], name="col")
pdt.assert_series_equal(ser_actual, ser_expected)  # passes

df_actual = df.drop_duplicates()
df_expected = pd.DataFrame({"col": ["", "\0"]})
pdt.assert_frame_equal(df_actual, df_expected)  # fails, only a single row left

Problem description

Test fails, esp. note the inconsistent behavior between Series.drop_duplicates and DataFrame.drop_duplicates.

Expected Output

Test passes.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-33-generic
machine : x86_64
processor :
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.4
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : 5.4.1
hypothesis : None
sphinx : 3.0.3
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@marco-neumann-by marco-neumann-by added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 3, 2020
@jorisvandenbossche
Copy link
Member

@marco-neumann-jdas Thanks for the report! That's indeed clearly a bug

The difference can also be seen in duplicated:

In [88]: df.duplicated()  
Out[88]: 
0    False
1     True
dtype: bool

In [89]: ser.duplicated()   
Out[89]: 
0    False
1    False
Name: col, dtype: bool

Now, under the hood, it seems that for DataFrame, this is based on factorize algo, and for series a custom duplicated algo:

In [97]: pd.core.algorithms.duplicated(ser)  
Out[97]: array([False, False])

In [98]: pd.core.algorithms.factorize(ser) 
Out[98]: (array([0, 0]), Index([''], dtype='object'))

And here you can see that factorize only sees a single unique value. So I think the bug is inside the factorize code.

And then, for factorize we actually have a string-specialized hash table (while duplicated only has a general object dtype specialization that is used). So for factorize, comparing those two hashtables:

In [99]: from pandas._libs.hashtable import StringHashTable, PyObjectHashTable 

In [100]: table = StringHashTable(len(ser))  

In [101]: table.factorize(ser.values)    
Out[101]: (array([''], dtype=object), array([0, 0]))

In [102]: table = PyObjectHashTable(len(ser)) 

In [103]: table.factorize(ser.values)  
Out[103]: (array(['', '\x00'], dtype=object), array([0, 1]))

indeed shows a difference.

@jorisvandenbossche jorisvandenbossche added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 4, 2020
@jorisvandenbossche jorisvandenbossche added this to the Contributions Welcome milestone Jun 4, 2020
@jorisvandenbossche
Copy link
Member

Ah, #32993 seems a similar report

@MJafarMashhadi
Copy link
Contributor

It goes down to using get_c_string

cdef inline const char* get_c_string(str py_string) except NULL:
return get_c_string_buf_and_size(py_string, NULL)
which ignores the returned string length.
I replaced that with PyUnicode_AsUTF8AndSize and got the string length as an integer in _unique method of StringHashTable.
It stores all the values in vecs array and then passes them one by one to kh_get_str which assumes the input string as a C-style null terminated one.

v = vecs[i]
k = kh_get_str(self.table, v)

The very existence of get_c_string which ignores the string length is the source of all the trouble. Based on it being only used in StringHashTable, I recon it exists because khash ('s kh_get_str) doesn't accept string length as an input.

I couldn't go any further bc I got a bit confused about khash; the definition of kh_get_str is there but I couldn't find its implementation. I would appreciate your input @jorisvandenbossche

Is it an external library or is it a part of pandas? Can we make changes to that? If it is an external library and we cannot change it, what is the best way to go? I am guessing reporting the issue to khash developers and putting some warnings in docs and todo comments in the code about it.


Quick and dirty debugging with a print:

pandas/_libs/hashtable_class_helper.pxi.in:810

try:
    v = get_c_string_buf_and_size(<str>val, &string_length)
except UnicodeEncodeError:
    v = get_c_string_buf_and_size(<str>repr(val), &string_length)
print('what hasher sees:', v, 'what it actually is:',v[:string_length])

Output:

$ python -c 'import pandas as pd; pd.DataFrame({"col": ["a\0Jason", "a\0Kyle"]}).duplicated()'               
what hasher sees: b'a' what it actually is: b'a\x00Jason'
what hasher sees: b'a' what it actually is: b'a\x00Kyle'

@mroeschke mroeschke added duplicated duplicated, drop_duplicates Strings String extension data type and string data and removed Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Aug 7, 2021
@josteinbf
Copy link

josteinbf commented Oct 14, 2021

I've seen the same problem with groupby:

import pandas as pd

print(pd._version.get_versions())

assert (pd.Series(['\x00']) != pd.Series([''])).all()           # ok
assert len(pd.Series(['\x00', '']).to_frame('col1').groupby('col1')) == 2   # fails

My pandas version:

{'date': '2021-09-12T11:02:49+0100', 'dirty': False, 'error': None, 'full-revisionid': '73c68257545b5f8530b7044f56647bd2db92e2ba', 'version': '1.3.3'}

Also, numpy has an issue which seems very similar, see numpy/numpy#20118; however, in numpy this is considered a consequence of how numpy stores strings, and not something that will be changed.

@cr-perry
Copy link

cr-perry commented Jan 16, 2022

I'm experiencing the same issue with the creation of a pd.MultiIndex. Given two distinct input values that are identical up to their null character, the index maps them to a single code value and they then both end up getting assigned with the first string value.

My research trail led me to factorize, StringHashTable and (the elusive) kh_get_str. Not sure how to proceed but happy to help (and of course, add my +1 for the issue).

My reproduction:

>>> a = pd.MultiIndex.from_tuples([('test\x001',), ('test\x002',)])
>>> print(a.levels, a.codes)
[['test1']] [[0, 0]]

note: pandas 1.0.1

@DeanLa
Copy link
Contributor

DeanLa commented Mar 8, 2022

I'm experiencing something similar with nunique, groupby and value_counts.

s = pd.Series(['foo', 'b\x00a\x00z\x00', 'b\x00az'])
s.nunique() # result: 2
s.value_counts().shape[0] # result: 3
s.to_frame('bar').assign(value=1).groupby('bar').value.count().shape[0] # result: 2

Note that which answer is "correct" is up to debate. What's important here is consistency.

pandas : 1.3.5

@Maaslak
Copy link

Maaslak commented Mar 10, 2022

I'm having a similar issue with multiindex. I suppose during deduplication characters after \x00 are not analyzed. As a result names of columns, differing in consecutive characters, are considered to be the same.

>>> list(pd.MultiIndex.from_tuples([('dstrings', '\x00\x01'), ('dstrings', '\x00\xff')]))
[('dstrings', '\x00\x01'), ('dstrings', '\x00\x01')]

@MJafarMashhadi
Copy link
Contributor

MJafarMashhadi commented Mar 10, 2022

Anywhere that a string is used in a hash map/table (Indices, groupbys, column names (which are also indices), count unique, etc) this will occur since hashing logic is implemented in C where strings are null-terminated. Refer to my previous comment: #34551 (comment)

@simonjayhawkins
Copy link
Member

note the inconsistent behavior between Series.drop_duplicates and DataFrame.drop_duplicates

Note that since #45534 the behavior is consistent for a Series and a single column DataFrame, but the issue still persists with a multi-column DataFrame cc @phofl

@phofl
Copy link
Member

phofl commented Jun 13, 2022

This is because I aligned the implementation of both cases with each other. Multi-column drop_duplicates still runs through the previous code paths.

@simonjayhawkins
Copy link
Member

And then, for factorize we actually have a string-specialized hash table (while duplicated only has a general object dtype specialization that is used). So for factorize, comparing those two hashtables:

there is an open issue to use StringHashTable for value_counts / duplicated with strings #14860 which should address this inconistency.

@phofl
Copy link
Member

phofl commented Jun 13, 2022

We are using factorize under the hood for duplicated, which already uses StringHashTable

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Bug duplicated duplicated, drop_duplicates hashing hash_pandas_object Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

10 participants