Skip to content

Package import fails when using tensorflow-gpu and CUDA libraries are missing #532

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Closed
guillaumekln opened this issue Sep 20, 2019 · 8 comments · Fixed by #539
Closed

Package import fails when using tensorflow-gpu and CUDA libraries are missing #532

guillaumekln opened this issue Sep 20, 2019 · 8 comments · Fixed by #539
Labels
bug Something isn't working build

Comments

@guillaumekln
Copy link
Contributor

guillaumekln commented Sep 20, 2019

System information

  • OS Platform and Distribution: Ubuntu 16.04
  • TensorFlow version and how it was installed: 2.0.0rc1 (binary)
  • TensorFlow-Addons version and how it was installed: 0.5.1 (binary)
  • Python version: 3.5.2
  • Is GPU used? No

Describe the bug

When using tensorflow-gpu, importing the tensorflow_addons package fails if CUDA libraries are missing while importing tensorflow itself works without issue.

Code to reproduce the issue

On a non-GPU system:

$ pip install tensorflow-addons==0.5.1
$ python
>>> import tensorflow_addons as tfa

Other info / logs

2019-09-20 18:04:16.387041: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
Segmentation fault

@guillaumekln guillaumekln added bug Something isn't working build labels Sep 20, 2019
@guillaumekln guillaumekln changed the title 0.5.1 package fails on import when CUDA libraries are missing Package import fails when using tensorflow-gpu and CUDA libraries are missing Sep 20, 2019
@seanpmorgan
Copy link
Member

seanpmorgan commented Sep 20, 2019

Thanks for catching. Just for clarity import tensorflow using tensorflow-gpu works on the CPU only system?

@guillaumekln
Copy link
Contributor Author

Yes.

@seanpmorgan
Copy link
Member

Thx.. I was also able to confirm this using the custom-op cpu only docker image.

cc @gunan @yifeif Do either of you have any insight on why our dynamic loading (when paired with tensorflow-gpu==2.0.0rc1) would require CUDA?

I tested the same release 0.5.1 with just tensorflow==2.0.0rc1 and it properly loaded CPU kernels only

@seanpmorgan
Copy link
Member

seanpmorgan commented Sep 20, 2019

Hmmm as just a note. When running tf.test.is_gpu_available() with tensorflow-gpu I get the same warning:

tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory

But it doesn't Segfault right afterward and eventually returns False.

EDIT ---- Not the same warning. Our package was looking for libcudart instead of libcuda

@seanpmorgan
Copy link
Member

Full trace using faulthandler. Not much info other than it happens in load_library

Fatal Python error: Segmentation fault

Current thread 0x00007ff858e27700 (most recent call first):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/load_library.py", line 61 in load_op_library
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_addons/activations/gelu.py", line 25 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_addons/activations/__init__.py", line 21 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1023 in _handle_fromlist
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_addons/__init__.py", line 21 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  File "<frozen importlib._bootstrap>", line 665 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "<stdin>", line 1 in <module>
Segmentation fault (core dumped)

@gunan
Copy link

gunan commented Sep 20, 2019

One scenario I see happening is the following:
TF achieves loading GPU code on cuda-less machines by not explicitly dynamically linking cuda libraries. Is it possible that tf addons package explicitly links CUDA?
To confirm this:

$ python -m pip install tensorflow-addons==0.5.1 --user
$ for f in `find .local/lib/python2.7/site-packages/tensorflow_addons  -name "*.so"`; do echo "**********************  $f"; ldd $f | grep cuda; done
**********************  .local/lib/python2.7/site-packages/tensorflow_addons/custom_ops/activations/_activation_ops.so
	libcudart-1581fefa.so.10.0.130 => /usr/local/google/home/gunan/.local/lib/python2.7/site-packages/tensorflow_addons/custom_ops/activations/../../.libs/libcudart-1581fefa.so.10.0.130 (0x00007f149edc2000)
**********************  .local/lib/python2.7/site-packages/tensorflow_addons/custom_ops/layers/_correlation_cost_ops.so
	libcudart-1581fefa.so.10.0.130 => /usr/local/google/home/gunan/.local/lib/python2.7/site-packages/tensorflow_addons/custom_ops/layers/../../.libs/libcudart-1581fefa.so.10.0.130 (0x00007f7046e0b000)
**********************  .local/lib/python2.7/site-packages/tensorflow_addons/custom_ops/image/_image_ops.so
	libcudart-1581fefa.so.10.0.130 => /usr/local/google/home/gunan/.local/lib/python2.7/site-packages/tensorflow_addons/custom_ops/image/../../.libs/libcudart-1581fefa.so.10.0.130 (0x00007f66e3069000)
**********************  .local/lib/python2.7/site-packages/tensorflow_addons/custom_ops/image/_distort_image_ops.so
	libcudart-1581fefa.so.10.0.130 => /usr/local/google/home/gunan/.local/lib/python2.7/site-packages/tensorflow_addons/custom_ops/image/../../.libs/libcudart-1581fefa.so.10.0.130 (0x00007f9a57252000)
**********************  .local/lib/python2.7/site-packages/tensorflow_addons/custom_ops/text/_skip_gram_ops.so
**********************  .local/lib/python2.7/site-packages/tensorflow_addons/custom_ops/seq2seq/_beam_search_ops.so
	libcudart-1581fefa.so.10.0.130 => /usr/local/google/home/gunan/.local/lib/python2.7/site-packages/tensorflow_addons/custom_ops/seq2seq/../../.libs/libcudart-1581fefa.so.10.0.130 (0x00007fe4b5039000)

Yifei, do we have our wrappers for CUDA code exposed? is there a way to rewrite tfa to use those, and avoid directly linking CUDA?

@facaiy
Copy link
Member

facaiy commented Sep 20, 2019

I find the same problem when using tensorflow-addons==0.5.0 with tensorflow-gpu==2.0.0-rc0 on Linux non-gpu environment.

@yifeif
Copy link

yifeif commented Sep 20, 2019

I think we can link against libtensorflow_framework.so instead of nvidia's shared library as long as the symbols used by custom ops are wrapped in TF, but I haven't tried it yet. The tool chain needs to be udpated to point to libtensorflow_framework.so.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working build
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants