Skip to content

BUG: pandas 1.1.0 MemoryError using .astype("string") which worked using pandas 1.0.5 #35499

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
2 of 3 tasks
ldacey opened this issue Jul 31, 2020 · 9 comments · Fixed by #35519
Closed
2 of 3 tasks

BUG: pandas 1.1.0 MemoryError using .astype("string") which worked using pandas 1.0.5 #35499

ldacey opened this issue Jul 31, 2020 · 9 comments · Fixed by #35519
Labels
Bug Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data
Milestone

Comments

@ldacey
Copy link

ldacey commented Jul 31, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

I tried to pinpoint the specific row which causes the error. The column has HTML data like this:

'''<div class="comment" dir="auto"><p dir="auto">Request <a href="/agent/tickets/test" target="_blank" rel="ticket">#test</a> "RE: ** M ..." Last comment in request </a>:</p> <p dir="auto"></p> <p dir="auto">Thank you</p> <p dir="auto">Stuff.</p> <p dir="auto">We will keep you posted .</p> <p dir="auto">Regards,</p> <p dir="auto">Name</p></div>'''

#This fails (memory error below):
df['event_html_body'].astype("string")

#Filtering the dataframe to only convert rows which are not null for this field **works**
x = df[~df.event_html_body.isnull()][['event_html_body']]
x['event_html_body'].astype("string")


#Filling NAs with another value fails:
df['event_html_body'].fillna('-').astype("string")

Problem description

I have code which has been converting columns to the "string" dtype and this has worked up until pandas 1.1.0

For example, I tried to process a file which I successfully processed in April and it works when I use .astype(str), but it fails when I use .astype("string") event though this worked in pandas 1.0.5.

The column does not need to be the new "string" type, but I wanted to raise this issue anyways.

Rows: 201368
Empty/null rows for the column in question: 189014 / 201368

So this column is quite sparse, and as I mentioned below if I filtered the nulls and then do .astype("string") then it will run fine. I am not sure why this worked before (same server, 64 GB of RAM), and this file was previous processed as a "string" before the update.

Error:


MemoryError                               Traceback (most recent call last)
<ipython-input-38-939f88862e64> in <module>
----> 1 df['event_html_body'].astype("string")

/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors)
   5535         else:
   5536             # else, only a single dtype is given
-> 5537             new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors,)
   5538             return self._constructor(new_data).__finalize__(self, method="astype")
   5539 

/opt/conda/lib/python3.7/site-packages/pandas/core/internals/managers.py in astype(self, dtype, copy, errors)
    565         self, dtype, copy: bool = False, errors: str = "raise"
    566     ) -> "BlockManager":
--> 567         return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
    568 
    569     def convert(

/opt/conda/lib/python3.7/site-packages/pandas/core/internals/managers.py in apply(self, f, align_keys, **kwargs)
    394                 applied = b.apply(f, **kwargs)
    395             else:
--> 396                 applied = getattr(b, f)(**kwargs)
    397             result_blocks = _extend_blocks(applied, result_blocks)
    398 

/opt/conda/lib/python3.7/site-packages/pandas/core/internals/blocks.py in astype(self, dtype, copy, errors)
    588             vals1d = values.ravel()
    589             try:
--> 590                 values = astype_nansafe(vals1d, dtype, copy=True)
    591             except (ValueError, TypeError):
    592                 # e.g. astype_nansafe can fail on object-dtype of strings

/opt/conda/lib/python3.7/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy, skipna)
    911     # dispatch on extension dtype if needed
    912     if is_extension_array_dtype(dtype):
--> 913         return dtype.construct_array_type()._from_sequence(arr, dtype=dtype, copy=copy)
    914 
    915     if not isinstance(dtype, np.dtype):

/opt/conda/lib/python3.7/site-packages/pandas/core/arrays/string_.py in _from_sequence(cls, scalars, dtype, copy)
    215 
    216         # convert to str, then to object to avoid dtype like '<U3', then insert na_value
--> 217         result = np.asarray(result, dtype=str)
    218         result = np.asarray(result, dtype="object")
    219         if has_nans:

/opt/conda/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

MemoryError: Unable to allocate array with shape (201368,) and data type <U131880

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d9fff27
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.3.0-1028-azure
Version : #29~18.04.1-Ubuntu SMP Fri Jun 5 14:32:34 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0
Cython : 0.29.14
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fsspec : 0.6.2
fastparquet : None
gcsfs : None
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : 1.0.0
pytables : None
pyxlsb : 1.0.6
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : None
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.48.0

@ldacey ldacey added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 31, 2020
@simonjayhawkins
Copy link
Member

Thanks @ldacey for the report. I can confirm this was working in 1.0.5. This change in behaviour is due to #33465 cc @topper-123

>>> print(pd.__version__)
1.1.0.dev0+1676.gb6ea970f8
>>>
>>> data = "a" * 131880
>>>
>>> res = pd.DataFrame([data] * 201368, dtype="string")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\simon\pandas\pandas\core\frame.py", line 515, in __init__
    mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
  File "C:\Users\simon\pandas\pandas\core\internals\construction.py", line 186, in init_ndarray
    return arrays_to_mgr(values, columns, index, columns, dtype=dtype)
  File "C:\Users\simon\pandas\pandas\core\internals\construction.py", line 83, in arrays_to_mgr
    arrays = _homogenize(arrays, index, dtype)
  File "C:\Users\simon\pandas\pandas\core\internals\construction.py", line 351, in _homogenize
    val = sanitize_array(
  File "C:\Users\simon\pandas\pandas\core\construction.py", line 441, in sanitize_array
    subarr = _try_cast(data, dtype, copy, raise_cast_failure)
  File "C:\Users\simon\pandas\pandas\core\construction.py", line 542, in _try_cast
    subarr = array_type(arr, dtype=dtype, copy=copy)
  File "C:\Users\simon\pandas\pandas\core\arrays\string_.py", line 218, in _from_sequence
    result = np.asarray(result, dtype=str)
  File "C:\Users\simon\Anaconda3\envs\pandas-dev\lib\site-packages\numpy\core\_asarray.py", line 83, in asarray
    return array(a, dtype, copy=False, order=order)
MemoryError: Unable to allocate 98.9 GiB for an array with shape (201368,) and data type <U131880
>>>
>>> print(pd.__version__)
1.1.0.dev0+1675.g8c7d653a4
>>>
>>> data = "a" * 131880
>>>
>>> res = pd.DataFrame([data] * 201368, dtype="string")
>>> print(res)
                                                        0
0       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
1       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
2       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
3       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
4       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
...                                                   ...
201363  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
201364  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
201365  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
201366  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
201367  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...

[201368 rows x 1 columns]
>>>

@simonjayhawkins simonjayhawkins added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 1, 2020
@simonjayhawkins simonjayhawkins added this to the 1.1.1 milestone Aug 1, 2020
@topper-123
Copy link
Contributor

topper-123 commented Aug 1, 2020

The problem is the array conversion. Your example can be boiled down further to:

>>> data = ["a" * 131880] * 201368
>>> data = np.array(data, dtype=object)
>>> np.asarray(data, str)
MemoryError: Unable to allocate 98.9 GiB for an array with shape (201368,) and data type <U131880

The problem is we go from a small list (memorywise) to a large list, because data[0] is data[1] etc., while the array conversion makes each string as its own object.

The call to np.asarray(result , str) is to ensure all scalars are strings, so we don't have dangling e.g. ints or floats. For example the below example:

>>> data = [0] + ["a" * 131880] * 201368  # data[0] is not a string
>>> data = np.array(data, dtype=object)
>>> data = np.asarray(data, str)  # used to ensure data[0] converts to str
MemoryError: Unable to allocate 98.9 GiB for an array with shape (201369,) and data type <U131880

Is there a way to avoid this/make it more efficient? I've got a feeling that this is pushiong against the limits of what numpy can do with strings.

EDIT: added an example of why ``asarray(data, str)``` is used.

@topper-123
Copy link
Contributor

@idacey, an alternative to your usecase coulds be to use StringArray, that bypasses the conversion, i.e.:

>>> data = ["a" * 131880] * 201368
>>> data = pd.arrays.StringArray(data)
>>> pd.Series(data)

@simonjayhawkins simonjayhawkins added the Strings String extension data type and string data label Aug 3, 2020
@TomAugspurger
Copy link
Contributor

Just as a note: I would not expect converting a list of many references to the same object to a StringArray to necessarily preserve that fact once it's converted to StringArray's internal storage. If / when we use Arrow for storing string memory that wouldn't be possible.

@simonjayhawkins
Copy link
Member

That is probably not what the OP is doing. I used that as a reproducible code sample. I think the issue is that the memory needed is proportional to the number of rows times the longest string. (all rows could be different)

@ldacey
Copy link
Author

ldacey commented Aug 3, 2020

As far as what I am doing:

My current code has a pyarrow_schema that I defined. For any column I declared as pa.string(), my script would convert the data to .astype("string"). I tried to be explicit with these types because I faced issues with inconsistent schemas when reading data from the parquet files downstream (related to null data for the most part).

pyarrow_schema = pa.schema(
    [   ("ticket_id", pa.string()),
        ("solved_at", pa.timestamp(unit="ns")),
        ("closed_at", pa.timestamp(unit="ns")),
        ("event_subject", pa.string()),
        ("event_html_body", pa.string()),
])

For now, I switched back to .astype(str) instead since that runs on the current version of pandas for this data.

@simonjayhawkins
Copy link
Member

maybe a more representative MRE

>>> pd.__version__
'1.2.0.dev0+17.ga0c8425a5'
>>>
>>> data = np.full(201368, fill_value=np.nan, dtype=object)
>>>
>>> data[0] = "a" * 131880
>>> df = pd.DataFrame(data)
>>>
>>> df.astype("string")
Traceback (most recent call last):
...
MemoryError: Unable to allocate 98.9 GiB for an array with shape (201368,) and data type <U131880
>>>
>>> pd.__version__
'1.0.5'
>>>
>>> data = np.full(201368, fill_value=np.nan, dtype=object)
>>>
>>> data[0] = "a" * 131880
>>> df = pd.DataFrame(data)
>>>
>>> df.astype("string")
                                                        0
0       aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
1                                                    <NA>
2                                                    <NA>
3                                                    <NA>
4                                                    <NA>
...                                                   ...
201363                                               <NA>
201364                                               <NA>
201365                                               <NA>
201366                                               <NA>
201367                                               <NA>

[201368 rows x 1 columns]
>>>

@topper-123
Copy link
Contributor

topper-123 commented Aug 3, 2020

Just as a note: I would not expect converting a list of many references to the same object to a StringArray to necessarily preserve that fact once it's converted to StringArray's internal storage. If / when we use Arrow for storing string memory that wouldn't be possible.

Yeah I agree, but the current (simplified) conversion chain np.asarray(data, dtype=object).astype(str).astype(object) seems wasteful, compared to just keeping existing string where possible, because strings are immutable. Only non-python-strings should ideally be converted.

@topper-123
Copy link
Contributor

@simonjayhawkins , both your examples fail if dtype=str in v.1.0:

>>> data = np.full(201368, fill_value=np.nan, dtype=object)
>>>
>>> data[0] = "a" * 131880
>>> df = pd.DataFrame(data, dtype=str)
MemoryError: Unable to allocate 98.9 GiB for an array with shape (201368,) and data type <U131880

In some sense, that dtype="string" worked in v.1.0 was " just luck", and we should fix this problem in both the dtype=str case and the dtype="string" case.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Bug Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants