-
Notifications
You must be signed in to change notification settings - Fork 631
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Compiling the TF plugin (using docker) fails #2213
Comments
I've installed nvidia-docker and the build process goes further but yet ends with an error + nvidia-docker run --name extract_dali_tf_prebuilt_manylinux1 nvidia/dali:cu100.build_tf_manylinux1 /bin/bash -c 'source /opt/dali/dali_tf_plugin/build_in_custom_op_docker.sh'
++ set -e
++ PYTHON_DIST_PACKAGES=($(python -c "import site; print(site.getsitepackages()[0])"))
+++ python -c 'import site; print(site.getsitepackages()[0])'
++ DALI_TOPDIR=/usr/local/lib/python2.7/dist-packages/nvidia/dali
+++ cat /usr/local/cuda/version.txt
+++ head -1
+++ sed 's/.*Version \([0-9]\+\)\.\([0-9]\+\).*/\1\2/'
++ CUDA_VERSION=100
+++ python ../qa/setup_packages.py -n -u tensorflow-gpu --cuda 100
Traceback (most recent call last):
File "../qa/setup_packages.py", line 6, in <module>
import urllib.parse
ImportError: No module named parse
++ LAST_CONFIG_INDEX= |
Hi @kindoblue, As for the second error, it looks like the script does not propagate the Python version properly and uses Python 2.7 for the TF Plugin containers. I will try to post a fix soon, will get back to you when I have a PR. Thanks for reporting that. |
I adjusted the scripts and docs a bit in #2214. On my machine it successfully built both the wheel and TF plugin. What's worth to mention, we recently started building one wheel that is compatible with several minor Python versions.
|
Hi, the PR has been merged, can you check if it helps so we can close the issue. |
In vacation now. Next week I would be able to test the fix. Thanks |
I tried again. The build process fails trying to build the plugin, with the error: + export DALI_TF_BUILDER_CONTAINER_MANYLINUX2010=extract_dali_tf_prebuilt_manylinux2010
+ DALI_TF_BUILDER_CONTAINER_MANYLINUX2010=extract_dali_tf_prebuilt_manylinux2010
+ docker run --gpus all --name extract_dali_tf_prebuilt_manylinux2010 nvidia/dali:cu100.build_tf_manylinux2010 /bin/bash -c 'source /opt/dali/dali_tf_plugin/build_in_custom_op_docker.sh'
++ set -e
++ PYTHON_DIST_PACKAGES=($(python -c "import site; print(site.getsitepackages()[0])"))
+++ python -c 'import site; print(site.getsitepackages()[0])'
++ DALI_TOPDIR=/usr/local/lib/python3.6/dist-packages/nvidia/dali
+++ cat /usr/local/cuda/version.txt
+++ head -1
+++ sed 's/.*Version \([0-9]\+\)\.\([0-9]\+\).*/\1\2/'
++ CUDA_VERSION=100
+++ python ../qa/setup_packages.py -n -u tensorflow-gpu --cuda 100
Traceback (most recent call last):
File "../qa/setup_packages.py", line 409, in <module>
main()
File "../qa/setup_packages.py", line 400, in main
print (cal_num_of_configs(args.use, args.cuda) - 1)
File "../qa/setup_packages.py", line 365, in cal_num_of_configs
ret *= pckg.get_num_of_version(cuda_version)
File "../qa/setup_packages.py", line 140, in get_num_of_version
return len(self.get_all_versions(cuda_version))
File "../qa/setup_packages.py", line 218, in get_all_versions
return self.filter_versions(self.versions[cuda_version])
File "../qa/setup_packages.py", line 106, in filter_versions
return [str(v) for v in versions if v]
File "../qa/setup_packages.py", line 106, in <listcomp>
return [str(v) for v in versions if v]
File "../qa/setup_packages.py", line 46, in __bool__
(not self.python_max_ver or parse(PYTHON_VERSION) <= parse(self.python_max_ver))
TypeError: 'module' object is not callable
++ LAST_CONFIG_INDEX= I tried to edit the file PS: I used the original command line BUILD_TF_PLUGIN=YES PYVER=3.7 CUDA_VERSION=10.0 ./build.sh because only now I realize that ======================================
but then I get another error: Writing nvidia-dali-tf-plugin-cuda100-0.26.0.dev0/setup.cfg
creating dist
Creating tar archive
removing 'nvidia-dali-tf-plugin-cuda100-0.26.0.dev0' (and everything under it)
++ cp dist/nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz /dali_tf_sdist
/opt/dali/dali_tf_plugin
++ popd
+ docker cp extract_dali_tf_sdist:/dali_tf_sdist/. dali_tf_sdist
+ cp dali_tf_sdist/nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz wheelhouse/
+ cp 'dali_tf_sdist/dummy/*.tar.gz' wheelhouse/dummy
cp: cannot stat 'dali_tf_sdist/dummy/*.tar.gz': No such file or directory
+ true
+ docker rm -f extract_dali_tf_sdist
extract_dali_tf_sdist
+ rm -rf dali_tf_plugin/whl
+ rm -rf dali_tf_sdist/
+ '[' NO == YES ']'
+ popd |
Hmm, the source should be mounted into docker, I'm not sure what is going on in here. I will check this on Monday if the issue still persists. |
There is an additional step that prepare plugin builder image in build.sh. Please relaunch with |
First of all I pruned all the docker stuff on my system with the command: Then I issued the command in the DALI/docker directory: Almost immediately the build script fails with an error similar to this one: Probably it is due to my system (Ubuntu 20.04) but anyway I modified all the calls (in After having compiled the half world now I have in ➜ wheelhouse git:(master) ✗ ls -ltr
total 261760
-rw-r--r-- 1 ice ice 267728670 aug 29 08:45 nvidia_dali_cuda100-0.26.0.dev0-12345-py3-none-manylinux2014_x86_64.whl
-rw-r--r-- 1 ice ice 306643 aug 29 09:41 nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz I don't see any whl for the dali tensorflow plugin, just a tar.gz. Is it supposed to be like this? Consider that the script is ending with the following output: Creating tar archive
removing 'nvidia-dali-tf-plugin-cuda100-0.26.0.dev0' (and everything under it)
++ cp dist/nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz /dali_tf_sdist
/opt/dali/dali_tf_plugin
++ popd
+ docker cp extract_dali_tf_sdist:/dali_tf_sdist/. dali_tf_sdist
+ cp dali_tf_sdist/nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz wheelhouse/
+ cp 'dali_tf_sdist/dummy/*.tar.gz' wheelhouse/dummy
cp: cannot stat 'dali_tf_sdist/dummy/*.tar.gz': No such file or directory
+ true
+ docker rm -f extract_dali_tf_sdist
extract_dali_tf_sdist
+ rm -rf dali_tf_plugin/whl
+ rm -rf dali_tf_sdist/
+ '[' NO == YES ']'
+ popd |
Yes, the Tensorflow Plugin is distributed as source distribution, hence the .tar.gz. If you kept the During installation it will check if the prebuilt libraries are compatible with the Tensorflow distribution you are using and install them. If they are not compatible (for example you have a Tensroflow built on your machine with different compiler than expected), it will attempt to ask the Tensorflow for configuration and build the plugin libraries during installation. If that fails you will be notified what didn't match in the configuration. |
I've managed to install the tf plugin with this command (setting CFLAGS because it wanted to compile the thing) CFLAGS="-I$CUDA_HOME/include $CFLAGS" pip install nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz Well, it was not smooth but I finally managed to have the dali and the TF plugin compiled. Thanks for the help. |
Hi, |
DALI 0.26 is available and should include the needed functionality. |
I'm trying to compile Dali with the following command, in the docker directory
BUILD_TF_PLUGIN=YES PYVER=3.7 CUDA_VERSION=10.0 ./build.sh
Dali got compiled and the wheel generated but then the script starts to build the TF plugin and I get the following error.
I don't recall reading in the documentation about installing nvidia-docker. Is it really needed for building a plugin within a docker image?
On Ubuntu 20.04. Using docker script.
The text was updated successfully, but these errors were encountered: