Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# Happy to provide a reproducible example if someone can guide on *what* the example should be since the issue only seems to happen on my full series of raw data
Issue Description
I receive the following exception when trying df.to_parquet()
:
ArrowInvalid: ("Could not convert <pyarrow.StringScalar: '50015202101011004687846'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column Ticket ID with type category')
This is the problematic column:
Same exception persists if I manually try to convert that whole column using pa.array()
:
ArrowInvalid: Could not convert <pyarrow.StringScalar: '50015202101011004687846'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type
Trying the same with a length 1 series for the value in the exception results in exactly the same exception
pa.array(df_full.loc[[15247763],"Ticket ID"])
but the conversion passes successfully if I remove unused categories through pandas beforehand:
pa.array(df_full.loc[[15247763],"Ticket ID"].cat.remove_unused_categories())
Expected Behavior
I can't understand what type arrow is trying to infer here, or why this expected type changes when I remove the categories for which their is no error.
Installed Versions
INSTALLED VERSIONS
commit : 4bfe3d0
python : 3.9.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.13.0-1021-aws
Version : #23~20.04.2-Ubuntu SMP Thu Mar 31 11:36:15 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.4.2
numpy : 1.19.5
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.0.4
Cython : None
pytest : 7.1.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.32.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli :
fastparquet : 0.8.1
fsspec : 2021.10.1
gcsfs : None
markupsafe : 2.0.1
matplotlib : 3.3.4
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : 7.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
snappy : None
sqlalchemy : 1.4.29
tables : 3.7.0
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None