MemoryError with more than 1E9 rows #8252

mattdowle · 2014-09-12T08:10:40Z

I have 240GB of RAM. Nothing else running on the machine. I'm trying to create 1.5E9 rows, which I think should create a data frame of around 100GB, but getting this MemoryError. This works fine with 1E9 but not 1.5E9. I could understand a limit at about 2^31 (2E9) or 2^32 (4E9) but all 240GB seems exhausted (according to htop) at somewhere between 1E9 and 1.5E9 rows. Any ideas? Thanks.

$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> import numpy as np
>>> import timeit
>>> pd.__version__
'0.14.1'
>>> def randChar(f, numGrp, N) :
...    things = [f%x for x in range(numGrp)]
...    return [things[x] for x in np.random.choice(numGrp, N)]
... 
>>> def randFloat(numGrp, N) :
...    things = [round(100*np.random.random(),4) for x in range(numGrp)]
...    return [things[x] for x in np.random.choice(numGrp, N)]
... 
>>> N=int(1.5e9)       # N=int(1e9) works fine
>>> K=100
>>> DF = pd.DataFrame({
...   'id1' : randChar("id%03d", K, N),       # large groups (char)
...   'id2' : randChar("id%03d", K, N),       # large groups (char)
...   'id3' : randChar("id%010d", N//K, N),   # small groups (char)
...   'id4' : np.random.choice(K, N),         # large groups (int)
...   'id5' : np.random.choice(K, N),         # large groups (int)
...   'id6' : np.random.choice(N//K, N),      # small groups (int)
...   'v1' :  np.random.choice(5, N),         # int in range [1,5]
...   'v2' :  np.random.choice(5, N),         # int in range [1,5]
...   'v3' :  randFloat(100,N)                # numeric e.g. 23.5749
... })
Traceback (most recent call last):
  File "<stdin>", line 10, in <module>
  File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 203, in __init__
    mgr = self._init_dict(data, index, columns, dtype=dtype)
  File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 327, in _init_dict
    dtype=dtype)
  File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 4630, in _arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3235, in create_block_manager_from_arrays
    blocks = form_blocks(arrays, names, axes)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3322, in form_blocks
    object_items, np.object_)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3346, in _simple_blockify
    values, placement = _stack_arrays(tuples, dtype)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3410, in _stack_arrays
    stacked = np.empty(shape, dtype=dtype)
MemoryError

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2494.070
BogoMIPS:              5054.21
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-7,16-23
NUMA node1 CPU(s):     8-15,24-31
$ free -h
             total       used       free     shared    buffers     cached
Mem:          240G       2.3G       237G       364K        66M       632M
-/+ buffers/cache:       1.6G       238G
Swap:           0B         0B         0B
$

An earlier question on S.O. is here : http://stackoverflow.com/questions/25631076/is-this-the-fastest-way-to-group-in-pandas

jreback · 2014-09-12T11:58:11Z

You can try separately creating Series (with each of the columns first), then putting them into a dict and creating the frame. However you might be having a problem finding contiguous memory.

jreback · 2014-09-20T16:03:48Z

@mattdowle

finally had time to look at this. I think their was an extra copy going on in certain cases.

so try this out using master (once I merge this change). This seems to scale much better.

and the following slightly modified code:

# your original routines were using lots of extra memory as they were creating many python objects
def randChar(f, num_group, N):
    things = np.array([f%x for x in range(num_group)])
    return things.take(np.random.choice(num_group,N)).astype('object')

def randFloat(num_group, N):
    things = (np.random.randn(num_group)*100).round(4)
    return things.take(np.random.choice(num_group,N))

def f4(K, N):
    objects = pd.DataFrame({'id1' : randChar("id%03d", K, N),      # large groups (char)
                            'id2' : randChar("id%03d", K, N),      # large groups (char)
                            'id3' : randChar("id%010d", N//K, N)   # small groups (char)
                            })
    ints = pd.DataFrame({ 'id4' : np.random.choice(K, N),         # large groups (int)
                          'id5' : np.random.choice(K, N),         # large groups (int)
                          'id6' : np.random.choice(N//K, N),      # small groups (int)
                          'v1' : np.random.choice(5, N),         # int in range [1,5]
                          'v2' : np.random.choice(5, N)          # int in range [1,5]
                          })
    floats = pd.DataFrame({ 'v3' : randFloat(100,N) })               # numeric e.g. 23.5749

    return pd.concat([objects,ints,floats],axis=1,copy=False)

jreback · 2014-09-21T17:54:54Z

@mattdowle I updated the example to give a pretty simplied version, that give pretty good memory performance (e.g is just a bit over 1X final data size) by not trying to create everything at once.

hayd mentioned this issue Sep 12, 2014

Compare / vbench groupby operations vs. R tapply, data.table #696

Closed

jreback mentioned this issue Sep 20, 2014

PERF: add copy=True argument to pd.concat to enable pass-thru concats with complete blocks (GH8252) #8331

Merged

jreback added Performance Reshaping labels Sep 20, 2014

jreback added this to the 0.15.0 milestone Sep 20, 2014

jreback closed this as completed in #8331 Sep 20, 2014

mattdowle mentioned this issue Sep 21, 2014

Rerun pandas 2E9 benchmark from dev Rdatatable/data.table#823

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MemoryError with more than 1E9 rows #8252

MemoryError with more than 1E9 rows #8252

mattdowle commented Sep 12, 2014

jreback commented Sep 12, 2014

jreback commented Sep 20, 2014

jreback commented Sep 21, 2014

MemoryError with more than 1E9 rows #8252

MemoryError with more than 1E9 rows #8252

Comments

mattdowle commented Sep 12, 2014

jreback commented Sep 12, 2014

jreback commented Sep 20, 2014

jreback commented Sep 21, 2014