-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: pandas 1.1.0 MemoryError using .astype("string") which worked using pandas 1.0.5 #35499
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Comments
Thanks @ldacey for the report. I can confirm this was working in 1.0.5. This change in behaviour is due to #33465 cc @topper-123
|
The problem is the array conversion. Your example can be boiled down further to: >>> data = ["a" * 131880] * 201368
>>> data = np.array(data, dtype=object)
>>> np.asarray(data, str)
MemoryError: Unable to allocate 98.9 GiB for an array with shape (201368,) and data type <U131880 The problem is we go from a small list (memorywise) to a large list, because The call to >>> data = [0] + ["a" * 131880] * 201368 # data[0] is not a string
>>> data = np.array(data, dtype=object)
>>> data = np.asarray(data, str) # used to ensure data[0] converts to str
MemoryError: Unable to allocate 98.9 GiB for an array with shape (201369,) and data type <U131880 Is there a way to avoid this/make it more efficient? I've got a feeling that this is pushiong against the limits of what numpy can do with strings. EDIT: added an example of why ``asarray(data, str)``` is used. |
@idacey, an alternative to your usecase coulds be to use StringArray, that bypasses the conversion, i.e.:
|
Just as a note: I would not expect converting a list of many references to the same object to a StringArray to necessarily preserve that fact once it's converted to StringArray's internal storage. If / when we use Arrow for storing string memory that wouldn't be possible. |
That is probably not what the OP is doing. I used that as a reproducible code sample. I think the issue is that the memory needed is proportional to the number of rows times the longest string. (all rows could be different) |
As far as what I am doing: My current code has a pyarrow_schema that I defined. For any column I declared as pa.string(), my script would convert the data to .astype("string"). I tried to be explicit with these types because I faced issues with inconsistent schemas when reading data from the parquet files downstream (related to null data for the most part).
For now, I switched back to .astype(str) instead since that runs on the current version of pandas for this data. |
maybe a more representative MRE
|
Yeah I agree, but the current (simplified) conversion chain |
@simonjayhawkins , both your examples fail if >>> data = np.full(201368, fill_value=np.nan, dtype=object)
>>>
>>> data[0] = "a" * 131880
>>> df = pd.DataFrame(data, dtype=str)
MemoryError: Unable to allocate 98.9 GiB for an array with shape (201368,) and data type <U131880 In some sense, that dtype="string" worked in v.1.0 was " just luck", and we should fix this problem in both the |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
I tried to pinpoint the specific row which causes the error. The column has HTML data like this:
'''<div class="comment" dir="auto"><p dir="auto">Request <a href="/agent/tickets/test" target="_blank" rel="ticket">#test</a> "RE: ** M ..." Last comment in request </a>:</p> <p dir="auto"></p> <p dir="auto">Thank you</p> <p dir="auto">Stuff.</p> <p dir="auto">We will keep you posted .</p> <p dir="auto">Regards,</p> <p dir="auto">Name</p></div>'''
Problem description
I have code which has been converting columns to the "string" dtype and this has worked up until pandas 1.1.0
For example, I tried to process a file which I successfully processed in April and it works when I use .astype(str), but it fails when I use .astype("string") event though this worked in pandas 1.0.5.
The column does not need to be the new "string" type, but I wanted to raise this issue anyways.
Rows: 201368
Empty/null rows for the column in question: 189014 / 201368
So this column is quite sparse, and as I mentioned below if I filtered the nulls and then do .astype("string") then it will run fine. I am not sure why this worked before (same server, 64 GB of RAM), and this file was previous processed as a "string" before the update.
Error:
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : d9fff27
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-1028-azure
Version : #29~18.04.1-Ubuntu SMP Fri Jun 5 14:32:34 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.0
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fsspec : 0.6.2
fastparquet : None
gcsfs : None
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 1.0.0
pytables : None
pyxlsb : 1.0.6
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : None
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.48.0
The text was updated successfully, but these errors were encountered: