Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

CUDA error out of memory #2001

Open
1 task
ngreen123 opened this issue Oct 22, 2024 · 8 comments
Open
1 task

CUDA error out of memory #2001

ngreen123 opened this issue Oct 22, 2024 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@ngreen123
Copy link

#CUDA error out of memory despite having 20 GB GPU
<We are running a 27K frame video and recieving error messages saying:
2024-10-21 17:21:44.237412: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-21 17:21:44.237751: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
despite having a 20GB GPU. We were using smaller videos before, of only about 50 frames to trial, however still had these errors pop up (although only a few compared to the ~1000 we are getting now. the video finally finishes, however a type 1 error occurs without labeling the frames. We've noticed that if we dont analyze the last frame of the video, while we still get the memory errors, the type 1 error doesnt happen.>

Expected behaviour

Actual behaviour

Your personal set up

<windows 11, intel(R) Xeon(R) w3-2423 processor, NVIDIA RTX 4000 ada Generation 20 GB>

Environment packages
# paste output of `pip freeze` or `conda list` here

packages in environment at C:\Users\Sahay\anaconda3\envs\sleap:

Name Version Build Channel

absl-py 1.0.0 pypi_0 pypi
astunparse 1.6.3 pypi_0 pypi
attrs 21.4.0 pyhd8ed1ab_0 conda-forge
backports-zoneinfo 0.2.1 pypi_0 pypi
brotli 1.1.0 hcfcfb64_1 conda-forge
brotli-bin 1.1.0 hcfcfb64_1 conda-forge
ca-certificates 2024.8.30 h56e8100_0 conda-forge
cached-property 1.5.2 pypi_0 pypi
cachetools 4.2.4 pypi_0 pypi
cattrs 1.1.1 pyhd8ed1ab_0 conda-forge
certifi 2024.7.4 pyhd8ed1ab_0 conda-forge
charset-normalizer 2.0.9 pypi_0 pypi
cloudpickle 2.2.1 pyhd8ed1ab_0 conda-forge
cuda-nvcc 11.3.58 hb8d16a4_0 nvidia
cudatoolkit 11.3.1 hf2f0253_13 conda-forge
cudnn 8.2.1.32 h754d62a_0 conda-forge
cycler 0.11.0 pyhd8ed1ab_0 conda-forge
cytoolz 0.12.0 py37hcc03f2d_0 conda-forge
dask-core 2022.2.0 pyhd8ed1ab_0 conda-forge
efficientnet 1.0.0 pypi_0 pypi
flatbuffers 2.0 pypi_0 pypi
fonttools 4.38.0 py37h51bd9d9_0 conda-forge
freeglut 3.2.2 he0c23c2_3 conda-forge
freetype 2.12.1 hdaf720e_2 conda-forge
fsspec 2023.1.0 pyhd8ed1ab_0 conda-forge
gast 0.4.0 pypi_0 pypi
geos 3.11.0 h39d44d4_0 conda-forge
google-auth 2.3.3 pypi_0 pypi
google-auth-oauthlib 0.4.6 pypi_0 pypi
google-pasta 0.2.0 pypi_0 pypi
grpcio 1.43.0 pypi_0 pypi
h5py 3.1.0 pypi_0 pypi
hdmf 3.6.1 pypi_0 pypi
icu 69.1 h0e60522_0 conda-forge
idna 3.3 pypi_0 pypi
image-classifiers 1.0.0 pypi_0 pypi
imagecodecs-lite 2019.12.3 py37h0b711f8_5 conda-forge
imageio 2.35.1 pyh12aca89_0 conda-forge
imgaug 0.4.0 pyhd8ed1ab_1 conda-forge
imgstore 0.2.9 pypi_0 pypi
importlib-metadata 4.2.0 pypi_0 pypi
importlib-resources 5.12.0 pypi_0 pypi
intel-openmp 2024.2.1 h57928b3_1083 conda-forge
jasper 2.0.33 hc2e4405_1 conda-forge
joblib 1.3.2 pyhd8ed1ab_0 conda-forge
jpeg 9e hcfcfb64_3 conda-forge
jsmin 3.0.1 pyhd8ed1ab_0 conda-forge
jsonpickle 1.2 py_0 conda-forge
jsonschema 4.17.3 pypi_0 pypi
keras 2.7.0 pypi_0 pypi
keras-applications 1.0.8 pypi_0 pypi
keras-preprocessing 1.1.2 pypi_0 pypi
kiwisolver 1.4.4 py37h8c56517_0 conda-forge
lcms2 2.14 h90d422f_0 conda-forge
lerc 4.0.0 h63175ca_0 conda-forge
libblas 3.9.0 23_win64_mkl conda-forge
libbrotlicommon 1.1.0 hcfcfb64_1 conda-forge
libbrotlidec 1.1.0 hcfcfb64_1 conda-forge
libbrotlienc 1.1.0 hcfcfb64_1 conda-forge
libcblas 3.9.0 23_win64_mkl conda-forge
libclang 12.0.0 pypi_0 pypi
libdeflate 1.14 hcfcfb64_0 conda-forge
libhwloc 2.11.1 default_h8125262_1000 conda-forge
libiconv 1.17 hcfcfb64_2 conda-forge
liblapack 3.9.0 23_win64_mkl conda-forge
liblapacke 3.9.0 23_win64_mkl conda-forge
libopencv 4.5.5 py37h542666b_10 conda-forge
libpng 1.6.43 h19919ed_0 conda-forge
libprotobuf 3.20.3 h12be248_0 conda-forge
libsodium 1.0.18 h8d14728_1 conda-forge
libsqlite 3.46.0 h2466b09_0 conda-forge
libtiff 4.4.0 hc4f729c_5 conda-forge
libwebp-base 1.4.0 hcfcfb64_0 conda-forge
libxcb 1.13 hcd874cb_1004 conda-forge
libxml2 2.12.7 h0f24e4e_4 conda-forge
libxslt 1.1.39 h3df6e99_0 conda-forge
libzlib 1.3.1 h2466b09_1 conda-forge
locket 1.0.0 pyhd8ed1ab_0 conda-forge
m2w64-gcc-libgfortran 5.3.0 6 conda-forge
m2w64-gcc-libs 5.3.0 7 conda-forge
m2w64-gcc-libs-core 5.3.0 7 conda-forge
m2w64-gmp 6.1.0 2 conda-forge
m2w64-libwinpthread-git 5.0.0.4634.697f757 2 conda-forge
markdown 3.3.6 pypi_0 pypi
markdown-it-py 2.2.0 pyhd8ed1ab_0 conda-forge
matplotlib-base 3.5.3 py37hbaab90a_2 conda-forge
mdurl 0.1.2 pyhd8ed1ab_0 conda-forge
mkl 2024.1.0 h66d3029_694 conda-forge
msys2-conda-epoch 20160418 1 conda-forge
munkres 1.1.4 pyh9f0ad1d_0 conda-forge
ndx-pose 0.1.1 pypi_0 pypi
networkx 2.7 pyhd8ed1ab_0 conda-forge
nixio 1.5.3 pypi_0 pypi
numpy 1.19.5 pypi_0 pypi
oauthlib 3.1.1 pypi_0 pypi
opencv 4.5.5 py37h03978a9_10 conda-forge
opencv-python-headless 4.2.0.34 pypi_0 pypi
openjpeg 2.5.0 hc9384bd_1 conda-forge
openssl 1.1.1w hcfcfb64_0 conda-forge
opt-einsum 3.3.0 pypi_0 pypi
packaging 21.3 pypi_0 pypi
pandas 1.3.5 py37h9386db6_0 conda-forge
partd 1.4.1 pyhd8ed1ab_0 conda-forge
patsy 0.5.6 pyhd8ed1ab_0 conda-forge
pillow 9.2.0 py37h42a8222_2 conda-forge
pip 24.0 pyhd8ed1ab_0 conda-forge
pkgutil-resolve-name 1.3.10 pypi_0 pypi
protobuf 3.19.1 pypi_0 pypi
psutil 5.9.3 py37h51bd9d9_0 conda-forge
pthread-stubs 0.4 hcd874cb_1001 conda-forge
pthreads-win32 2.9.1 hfa6e2cd_3 conda-forge
py-opencv 4.5.5 py37h90c5f73_10 conda-forge
pyasn1 0.4.8 pypi_0 pypi
pyasn1-modules 0.2.8 pypi_0 pypi
pygments 2.17.2 pyhd8ed1ab_0 conda-forge
pykalman 0.9.7 pyhd8ed1ab_0 conda-forge
pynwb 2.3.3 pypi_0 pypi
pyparsing 3.0.6 pypi_0 pypi
pyrsistent 0.19.3 pypi_0 pypi
pyside2 5.13.2 py37h760f651_8 conda-forge
python 3.7.12 h7840368_100_cpython conda-forge
python-dateutil 2.9.0 pyhd8ed1ab_0 conda-forge
python-rapidjson 1.9 py37h7f67f24_0 conda-forge
python_abi 3.7 4_cp37m conda-forge
pytz 2024.1 pyhd8ed1ab_0 conda-forge
pywavelets 1.3.0 py37h3a130e4_1 conda-forge
pyyaml 6.0 py37hcc03f2d_4 conda-forge
pyzmq 24.0.1 py37h7347f05_0 conda-forge
qimage2ndarray 1.10.0 pypi_0 pypi
qt 5.12.9 h556501e_6 conda-forge
qtpy 2.4.1 pyhd8ed1ab_0 conda-forge
requests 2.26.0 pypi_0 pypi
requests-oauthlib 1.3.0 pypi_0 pypi
rich 13.7.1 pyhd8ed1ab_0 conda-forge
ruamel-yaml 0.17.32 pypi_0 pypi
ruamel-yaml-clib 0.2.7 pypi_0 pypi
scikit-image 0.19.2 py37h9386db6_0 conda-forge
scikit-learn 1.0 py37ha78be43_1 conda-forge
scikit-video 1.1.11 pyh24bf2e0_0 conda-forge
scipy 1.7.3 py37hb6553fb_0 conda-forge
seaborn 0.12.2 hd8ed1ab_0 conda-forge
seaborn-base 0.12.2 pyhd8ed1ab_0 conda-forge
segmentation-models 1.0.1 pypi_0 pypi
setuptools 59.8.0 py37h03978a9_1 conda-forge
setuptools-scm 6.3.2 pypi_0 pypi
shapely 1.8.5 py37h475e9a0_0 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
sleap 1.3.3 pypi_0 pypi
sqlite 3.46.0 h2466b09_0 conda-forge
statsmodels 0.13.2 py37h3a130e4_0 conda-forge
tbb 2021.12.0 hc790b64_4 conda-forge
tensorboard 2.7.0 pypi_0 pypi
tensorboard-data-server 0.6.1 pypi_0 pypi
tensorboard-plugin-wit 1.8.0 pypi_0 pypi
tensorflow 2.7.0 pypi_0 pypi
tensorflow-estimator 2.7.0 pypi_0 pypi
tensorflow-hub 0.12.0 pyhca92ed8_0 conda-forge
tensorflow-io-gcs-filesystem 0.23.1 pypi_0 pypi
termcolor 1.1.0 pypi_0 pypi
threadpoolctl 3.1.0 pyh8a188c0_0 conda-forge
tifffile 2020.6.3 py_0 conda-forge
tk 8.6.13 h5226925_1 conda-forge
tomli 2.0.0 pypi_0 pypi
toolz 0.12.1 pyhd8ed1ab_0 conda-forge
typing-extensions 4.0.1 pypi_0 pypi
typing_extensions 4.7.1 pyha770c72_0 conda-forge
tzdata 2023.3 pypi_0 pypi
tzlocal 5.0.1 pypi_0 pypi
ucrt 10.0.22621.0 h57928b3_0 conda-forge
unicodedata2 14.0.0 py37hcc03f2d_1 conda-forge
urllib3 1.26.7 pypi_0 pypi
vc 14.3 h8a93ad2_20 conda-forge
vc14_runtime 14.40.33810 hcc2c482_20 conda-forge
vs2015_runtime 14.40.33810 h3bf8584_20 conda-forge
werkzeug 2.0.2 pypi_0 pypi
wheel 0.42.0 pyhd8ed1ab_0 conda-forge
wrapt 1.13.3 pypi_0 pypi
xorg-libxau 1.0.11 hcd874cb_0 conda-forge
xorg-libxdmcp 1.1.3 hcd874cb_0 conda-forge
xz 5.2.6 h8d14728_0 conda-forge
yaml 0.2.5 h8ffe710_2 conda-forge
zeromq 4.3.4 h0e60522_1 conda-forge
zipp 3.15.0 pypi_0 pypi
zstd 1.5.6 h0ea2cb4_0 conda-forge

@ngreen123 ngreen123 added the bug Something isn't working label Oct 22, 2024
@ngreen123
Copy link
Author

Here is more error info:

.
.
.
.\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.220579: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.220654: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.220953: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.221010: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.221090: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.221131: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.221350: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.221396: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.223891: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.223963: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.224118: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.224180: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.224515: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.224578: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.226757: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.226843: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.227054: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.227127: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.227958: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.228039: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.230018: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.230100: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.230259: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.230332: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.230622: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.230681: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.232530: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.232717: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.232889: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.232960: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.233150: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.233211: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.233310: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.233356: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.233521: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.233584: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.233972: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.234031: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.234167: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.234221: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.234473: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.234527: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.234734: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.234789: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.235050: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.235136: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.235500: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.235587: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.235858: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.235932: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.236139: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.236211: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.236368: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.236437: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.236540: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.236596: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.237277: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.237389: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.237596: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.237652: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.237891: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.237952: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.245353: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.245492: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.245675: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.245757: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.422618: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.422793: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.422949: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.423025: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.752297: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.752479: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:17.752635: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:17.752732: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:18.067664: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:18.067830: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:18.068455: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:18.068540: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:18.384663: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:18.384892: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
2024-10-22 09:22:18.385262: E tensorflow/stream_executor/cuda/cuda_driver.cc:802] failed to alloc 34359738368 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2024-10-22 09:22:18.385334: W .\tensorflow/core/common_runtime/device/device_host_allocator.h:46] could not allocate pinned host memory of size: 34359738368
Traceback (most recent call last):
File "C:\Users\Sahay\anaconda3\envs\sleap\Scripts\sleap-track-script.py", line 33, in
sys.exit(load_entry_point('sleap==1.3.3', 'console_scripts', 'sleap-track')())
File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\sleap\nn\inference.py", line 5424, in main
labels_pr = predictor.predict(provider)
File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\sleap\nn\inference.py", line 526, in predict
self._make_labeled_frames_from_generator(generator, data)
File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\sleap\nn\inference.py", line 3266, in _make_labeled_frames_from_generator
for ex in generator:
File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\sleap\nn\inference.py", line 455, in _predict_generator
for ex in self.pipeline.make_dataset():
File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 800, in next
return self._next_internal()
File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 786, in _next_internal
output_shapes=self._flat_output_shapes)
File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\tensorflow\python\ops\gen_dataset_ops.py", line 2844, in iterator_get_next
_ops.raise_from_not_ok_status(e, name)
File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\tensorflow\python\framework\ops.py", line 7107, in raise_from_not_ok_status
raise core._status_to_exception(e) from None # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.UnknownError: KeyError: "Unable to load frame 26621 from MediaVideo(filename='D:/nate/SLEAP retroorbital injected/Split videos/Cage_1_part1.avi', grayscale=True, bgr=True, dataset='', input_format='')."
Traceback (most recent call last):

File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\tensorflow\python\ops\script_ops.py", line 273, in call
return func(device, token, args)

File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\tensorflow\python\ops\script_ops.py", line 151, in call
outputs = self._call(device, args)

File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\tensorflow\python\ops\script_ops.py", line 158, in _call
ret = self._func(*args)

File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\tensorflow\python\autograph\impl\api.py", line 649, in wrapper
return func(*args, **kwargs)

File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\sleap\nn\data\providers.py", line 405, in py_fetch_frame
raw_image = self.video.get_frame(frame_ind)

File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\sleap\io\video.py", line 1104, in get_frame
return self.backend.get_frame(idx)

File "C:\Users\Sahay\anaconda3\envs\sleap\lib\site-packages\sleap\io\video.py", line 496, in get_frame
raise KeyError(f"Unable to load frame {idx} from {self}.")

KeyError: "Unable to load frame 26621 from MediaVideo(filename='D:/nate/SLEAP retroorbital injected/Split videos/Cage_1_part1.avi', grayscale=True, bgr=True, dataset='', input_format='')."

     [[{{node EagerPyFunc}}]] [Op:IteratorGetNext]

Process return code: 1

@eberrigan
Copy link
Contributor

Hi @ngreen123,

Can you provide the command you ran to get this error? I can't tell what your intended goal was.

It seems like you have two issues.

The first is an out of memory issue. Despite having 20 GB of GPU, it looks like you need ~34 GB to train with your given hyperparameters. Can you provide these hyperparameters (the contents of the config file to train or the model)?

The number of frames is not as important as the batch size, or the image size since we train and perform inference in batches. If you are training a model or running inference you can decrease the batch size to decrease the amount of GPU memory used. When you are training, you can decrease the input scale of the input image to decrease the resolution of each frame.

The second issue is that one of your frames cannot be loaded. This frame maybe corrupted. If you can reencode the video, or save the video in a different file format from the original frames, that could solve this issue.

KeyError: "Unable to load frame 26621 from MediaVideo(filename='D:/nate/SLEAP retroorbital injected/Split videos/Cage_1_part1.avi', grayscale=True, bgr=True, dataset='', input_format='')."

Best,

Elizabeth

@eberrigan eberrigan self-assigned this Oct 22, 2024
@ngreen123
Copy link
Author

Hi @eberrigan, thanks for getting back to us!

Here's our command line once I've initiated inf, and I've attached a screenshot of our parameters:
Using already trained model for multi_instance: D:/nate/sleap model/230811_235437.multi_instance.n=1019/training_config.json
Command line call:
sleap-track D:/nate/sleap model/labels.v001.slp --video.index 0 --frames 0,-26998 -m D:/nate/sleap model/230811_235437.multi_instance.n=1019/training_config.json --tracking.tracker flowmaxtracks --tracking.max_tracks 2 --tracking.similarity instance --tracking.match hungarian --tracking.track_window 4 --tracking.post_connect_single_breaks 0 --tracking.max_tracking 1 -o D:/nate/sleap model\predictions\labels.v001.slp.241022_164023.predictions.slp --verbosity json --no-empty-frames

Started inference at: 2024-10-22 16:40:34.051479
Args:
{
'data_path': 'D:/nate/sleap model/labels.v001.slp',
'models': [
'D:/nate/sleap model/230811_235437.multi_instance.n=1019/training_config.json'
],
'frames': '0,-26998',
'only_labeled_frames': False,
'only_suggested_frames': False,
'output': 'D:/nate/sleap model\predictions\labels.v001.slp.241022_164023.predictions.slp',
'no_empty_frames': True,
'verbosity': 'json',
'video.dataset': None,
'video.input_format': 'channels_last',
'video.index': '0',
'cpu': False,
'first_gpu': False,
'last_gpu': False,
'gpu': 'auto',
'max_edge_length_ratio': 0.25,
'dist_penalty_weight': 1.0,
'batch_size': 4,
'open_in_gui': False,
'peak_threshold': 0.2,
'max_instances': None,
'tracking.tracker': 'flowmaxtracks',
'tracking.max_tracking': True,
'tracking.max_tracks': 2,
'tracking.target_instance_count': None,
'tracking.pre_cull_to_target': None,
'tracking.pre_cull_iou_threshold': None,
'tracking.post_connect_single_breaks': 0,
'tracking.clean_instance_count': None,
2024-10-22 16:40:35.896423: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
'tracking.clean_iou_threshold': None,
'tracking.similarity': 'instance',
'tracking.match': 'hungarian',
'tracking.robust': None,
'tracking.track_window': 4,
'tracking.min_new_track_points': None,
'tracking.min_match_points': None,
'tracking.img_scale': None,
'tracking.of_window_size': None,
'tracking.of_max_levels': None,
'tracking.save_shifted_instances': None,
'tracking.kf_node_indices': None,
'tracking.kf_init_frame_count': None
}
2024-10-22 16:40:36.743690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 17594 MB memory: -> device: 0, name: NVIDIA RTX 4000 Ada Generation, pci bus id: 0000:55:00.0, compute capability: 8.9

INFO:sleap.nn.inference:Auto-selected GPU 0 with 19698 MiB of free memory.
2024-10-22 16:40:46.086805: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] Loaded cuDNN version 8201
2024-10-22 16:40:48.210046: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] INTERNAL: ptxas exited with non-zero error code -1, output:
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
Versions:
SLEAP: 1.3.3
TensorFlow: 2.7.0
Numpy: 1.21.6
Python: 3.7.12
OS: Windows-10-10.0.22621-SP0

System:
GPUs: 1/1 available
Device: /physical_device:GPU:0
Available: True
Initalized: False
Memory growth: True

#######
model:

Image

@eberrigan
Copy link
Contributor

Hi @ngreen123,

  1. Did you train your model on the same machine? Inference should be less GPU-intensive than training.

  2. Are you using quotations around the paths in your command line arguments? I am a little surprised those are working without quotations marks around the path strings.

  3. Let's try inference without tracking to narrow down the problem.

  4. Then please try decreasing the batch size.

Thanks!

Elizabeth

@ngreen123
Copy link
Author

Hi,

  1. We trained it on a different machine, however, we were getting the same problem as before. That machine had significantly less memory

  2. We are running from the GUI and not typing anything into the terminal itself...not sure if we have control over the quotations marks

  3. I tried it without tracking on an mp4 and that seemed to fix the memory problem! Now I'll just have to figure out how to post-inference track

  4. I'll try this next. We're a little worried that this may reduce the accuracy of the model but it's worth a try

Thanks again!

@eberrigan
Copy link
Contributor

Please take a look at the examples here https://sleap.ai/guides/cli.html#sleap-track. You can run tracking without inference if the predictions file is specified and no models are specified.

So does the inference with tracking complete when using an mp4?

@ngreen123
Copy link
Author

Great! Thank you

I tried it out with tracking, and using the same tracking criteria as before, I was getting memory issues, however when I bumped the elapsed frame window down to 2 I get almost no memory warnings or errors!

@eberrigan
Copy link
Contributor

Yay! Please let us know if you have any more issues. I will mark this issue as done.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants