Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

OSError: [Errno 28] No space left on device #277

Open
1 task done
DTDwind opened this issue Oct 23, 2024 · 6 comments
Open
1 task done

OSError: [Errno 28] No space left on device #277

DTDwind opened this issue Oct 23, 2024 · 6 comments

Comments

@DTDwind
Copy link

DTDwind commented Oct 23, 2024

General

  • Operating System: Docker(python:3.12-slim)
  • Python version: 3.12.5
  • Pandas version: 2.2.2
  • Pandarallel version: 1.6.5

Acknowledgement

  • My issue is NOT present when using pandas without alone (without pandarallel)

Bug description

Observed behavior

When I execute the program, I get "OSError: [Errno 28] No space left on device"

This is my code.

I referred to #127 and added MEMORY_FS_ROOT and JOBLIB_TEMP_FOLDER, but it doesn't work.

import pandas as pd
from pandarallel import pandarallel
import os

os.environ['MEMORY_FS_ROOT'] = "/app/tmp"
os.environ['JOBLIB_TEMP_FOLDER'] = '/app/tmp'

data = {'url': ['https://example.com/1', 'https://example.com/2'],
        'label': [0, 1]}
table = pd.DataFrame(data)

pandarallel.initialize(progress_bar=False, use_memory_fs = False)

table['count.'] = table['url'].parallel_apply(lambda x: x.count('.')) # parallel_apply apply
table

df -h for my docker:

Filesystem      Size  Used Avail Use% Mounted on
overlay         1.8T  260G  1.5T  15% /
tmpfs            64M     0   64M   0% /dev
shm              64M   64M     0 100% /dev/shm
/dev/nvme1n1    1.8T  260G  1.5T  15% /app
tmpfs            63G     0   63G   0% /proc/asound
tmpfs            63G     0   63G   0% /proc/acpi
tmpfs            63G     0   63G   0% /proc/scsi
tmpfs            63G     0   63G   0% /sys/firmware
tmpfs            63G     0   63G   0% /sys/devices/virtual/powercap

I also try os.environ['JOBLIB_TEMP_FOLDER'] = '/tmp'

Anyone can help me?

@highvight
Copy link

Did you try setting the MEMORY_FS_ROOT env variable before importing pandarallel?

You can check the current location with pandarallel.core.MEMORY_FS_ROOT

@DTDwind
Copy link
Author

DTDwind commented Oct 24, 2024

Hi @highvight ,

I try setting the MEMORY_FS_ROOT env variable before importing pandarallel and check the current location.

Following is my code.

import pandas as pd
import os

os.environ['MEMORY_FS_ROOT'] = "/app/tmp"
os.environ['JOBLIB_TEMP_FOLDER'] = '/app/tmp'

import pandarallel
print(pandarallel.core.MEMORY_FS_ROOT) # /app/tmp
pandarallel.pandarallel.initialize(progress_bar=False) # and I have try (progress_bar=False, use_memory_fs = False)

data = {'url': ['https://example.com/1', 'https://example.com/2'], 'label': [0, 1]}
table = pd.DataFrame(data)
table['count.'] = table['url'].parallel_apply(lambda x: x.count('.'))

MEMORY_FS_ROOT is /app/tmp.

ls -al for /app/tmp
drwxrwxrwx 2 root root 4096 Oct 23 17:32 tmp

ls -al for /app/tmp content
-rw------- 1 root root 637 Oct 24 09:42 pandarallel_input_g3g0gh6k.pickle
-rw------- 1 root root 637 Oct 24 09:42 pandarallel_input_k5bgg22r.pickle

I am still getting No space left on device, I don't know why.

Error message
image

@DTDwind
Copy link
Author

DTDwind commented Oct 24, 2024

I just confirmed that temporarily clearing /dev/shm allows small amounts of data to pass through the program, so it seems my modification is ineffective?

I tried modifying core.py, but it still doesn't work.

core.py
33| # MEMORY_FS_ROOT = os.environ.get("MEMORY_FS_ROOT", "/dev/shm")
34| MEMORY_FS_ROOT = /app/tmp

@usama3162
Copy link

@DTDwind Try setting use_memory_fs = False.

Note: MEMORY_FS_ROOT is only applied when use_memory_fs is set to True.

@chris-aeviator
Copy link

chris-aeviator commented Nov 9, 2024

I'm running into the same issues with no remedy from changing the settings or os.environ. I'm also inside docker, my dataset is ~3GB and my RAM 512 GB.

Looking at possible solutions:

  • docker has a laughable 64 MB as the default /dev/shm (df -h /dev/shm) but it can be changed using the --shm-size=1gb when starting the container
  • looking at DTDwind's re implementation, maybe SpooledFile or MemoryFS could help in ease the pains?!
class tempfile.SpooledTemporaryFile(max_size=0, mode='w+b', buffering=-1, encoding=None, newline=None, suffix=None, prefix=None, dir=None, *, errors=None)

This class operates exactly as TemporaryFile() does, except that data is spooled in memory until the file size exceeds max_size, or until the file’s fileno() method is called, at which point the contents are written to disk and operation proceeds as with TemporaryFile().

Update:

I can confirm that running pandarallel outside of docker on the same machine does not error. It wants to consume huge amounts of RAM though. I'm using 36 workers (auto-selected, also I want max. speed), my dataset is 3GB and RAM consumption rises to >260 GB for this (this is more than 2x the overall dataset if each worker would hold it 100%)

@DTDwind
Copy link
Author

DTDwind commented Nov 26, 2024

I used parallel_pandas, and it works well in the Docker environment.
Here is a simple example:
pip install --upgrade parallel-pandas

import pandas as pd
import numpy as np
from parallel_pandas import ParallelPandas

#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=16, split_factor=4, disable_pr_bar=True)

data = {'url': ['https://example.com/1', 'https://example.com/2'],
        'label': [0, 1]}
table = pd.DataFrame(data)
table['count.'] = table['url'].p_apply(lambda x: x.count('.'))

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants