-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: to_parquet
failing for integer-like string values in categorical column
#46863
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
to_parquet
failing for integer-like string values in categorical column
Again, I'm more than happy to provide a reproducible example. I actually noticed the strange behaviour with If someone more familiar with arrow/pyarrow can suggest a useful example I can test and provide it here. |
Thanks for the report - can you try reducing your data down to a simple example that still demonstrates the issue? For example, you could try on just the 1st half and then the 2nd half and see which of these fails - hopefully one will. Then continue in this manner until it is reasonable to post. |
@rhshadrach I've tried your suggestion but the error doesn't consistently keep appearing if I keep slicing the series: full_s = df_full["Ticket ID"].cat.remove_unused_categories()
pyarrow.array(full_s)
# ArrowInvalid: Could not convert <pyarrow.StringScalar: '50015202101011004687846'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type
h1_s = full_s.iloc[:len(full_s)//2].cat.remove_unused_categories()
pyarrow.array(h1_s)
# ArrowInvalid: Could not convert <pyarrow.StringScalar: '50015202101011004687846'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type
q1_s = h1_s.iloc[:len(h1_s)//2].cat.remove_unused_categories()
pyarrow.array(q1_s)
# works and creates a pyarrow.lib.DictionaryArray
q2_s = h1_s.iloc[len(h1_s)//2:].cat.remove_unused_categories()
pyarrow.array(q2_s)
# works and creates a pyarrow.lib.DictionaryArray btw |
I've tested the above code on a string typed series |
@RabeezRiaz - Yes, I understand that in this situation it may be difficult to come up with a reproducible example. However, if we are not able to reproduce the error, there isn't a whole lot that is able to be done. I would recommend trying different slices (other than just first half / second half). |
should probably close as can't reproduce. |
Agreed @simonjayhawkins; @RabeezRiaz if you can come up with a reproducible example, post it here and will be happy to reopen. |
@rhshadrach I've encountered the same issue. Here's a reproducible example: python:
An attempt to save either
Please note that the above example works if column |
Thanks @pwsiegel - reopened. It appears the error is only encountered when using Categorical where the underlying data is Not sure if this is a pandas or pyarrow issue; cc @jorisvandenbossche @datapythonista |
Is there any update on this @rhshadrach? I am running into the same issue now. |
Running into same problem @rhshadrach |
For the people who run into this failure, which version of pyarrow are you using? Can you also try with the latest version? (13.0) |
Yes, fails with 13.0 |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# Happy to provide a reproducible example if someone can guide on *what* the example should be since the issue only seems to happen on my full series of raw data
Issue Description
I receive the following exception when trying
df.to_parquet()
:This is the problematic column:

Same exception persists if I manually try to convert that whole column using
pa.array()
:Trying the same with a length 1 series for the value in the exception results in exactly the same exception
but the conversion passes successfully if I remove unused categories through pandas beforehand:
Expected Behavior
I can't understand what type arrow is trying to infer here, or why this expected type changes when I remove the categories for which their is no error.
Installed Versions
INSTALLED VERSIONS
commit : 4bfe3d0
python : 3.9.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.13.0-1021-aws
Version : #23~20.04.2-Ubuntu SMP Thu Mar 31 11:36:15 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.4.2
numpy : 1.19.5
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.0.4
Cython : None
pytest : 7.1.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.32.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli :
fastparquet : 0.8.1
fsspec : 2021.10.1
gcsfs : None
markupsafe : 2.0.1
matplotlib : 3.3.4
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 7.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
snappy : None
sqlalchemy : 1.4.29
tables : 3.7.0
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
The text was updated successfully, but these errors were encountered: