Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Compiling the TF plugin (using docker) fails #2213

Closed
kindoblue opened this issue Aug 15, 2020 · 13 comments
Closed

Compiling the TF plugin (using docker) fails #2213

kindoblue opened this issue Aug 15, 2020 · 13 comments
Labels
bug Something isn't working

Comments

@kindoblue
Copy link
Contributor

kindoblue commented Aug 15, 2020

I'm trying to compile Dali with the following command, in the docker directory

BUILD_TF_PLUGIN=YES PYVER=3.7 CUDA_VERSION=10.0 ./build.sh

Dali got compiled and the wheel generated but then the script starts to build the TF plugin and I get the following error.

+ nvidia-docker run --name extract_dali_tf_prebuilt_manylinux1 nvidia/dali:cu100.build_tf_manylinux1 /bin/bash -c 'source /opt/dali/dali_tf_plugin/build_in_custom_op_docker.sh'
./build.sh: line 261: nvidia-docker: command not found

I don't recall reading in the documentation about installing nvidia-docker. Is it really needed for building a plugin within a docker image?

On Ubuntu 20.04. Using docker script.

@kindoblue
Copy link
Contributor Author

I've installed nvidia-docker and the build process goes further but yet ends with an error

+ nvidia-docker run --name extract_dali_tf_prebuilt_manylinux1 nvidia/dali:cu100.build_tf_manylinux1 /bin/bash -c 'source /opt/dali/dali_tf_plugin/build_in_custom_op_docker.sh'
++ set -e
++ PYTHON_DIST_PACKAGES=($(python -c "import site; print(site.getsitepackages()[0])"))
+++ python -c 'import site; print(site.getsitepackages()[0])'
++ DALI_TOPDIR=/usr/local/lib/python2.7/dist-packages/nvidia/dali
+++ cat /usr/local/cuda/version.txt
+++ head -1
+++ sed 's/.*Version \([0-9]\+\)\.\([0-9]\+\).*/\1\2/'
++ CUDA_VERSION=100
+++ python ../qa/setup_packages.py -n -u tensorflow-gpu --cuda 100
Traceback (most recent call last):
  File "../qa/setup_packages.py", line 6, in <module>
    import urllib.parse
ImportError: No module named parse
++ LAST_CONFIG_INDEX=

@klecki klecki added the bug Something isn't working label Aug 17, 2020
@klecki
Copy link
Contributor

klecki commented Aug 17, 2020

Hi @kindoblue,
sorry for the confusion with the nvidia-docker, I will update the documentation and script so it can support the new syntax from NVIDIA Container Toolkit (docker run --gpus all).

As for the second error, it looks like the script does not propagate the Python version properly and uses Python 2.7 for the TF Plugin containers. I will try to post a fix soon, will get back to you when I have a PR.

Thanks for reporting that.

@klecki
Copy link
Contributor

klecki commented Aug 17, 2020

I adjusted the scripts and docs a bit in #2214.

On my machine it successfully built both the wheel and TF plugin.

What's worth to mention, we recently started building one wheel that is compatible with several minor Python versions.
It should be enough to use the script as follows, omitting the PYVER:

BUILD_TF_PLUGIN=YES CUDA_VERSION=10.0 ./build.sh

@klecki klecki added this to the Release_0.26.0 milestone Aug 19, 2020
@klecki
Copy link
Contributor

klecki commented Aug 24, 2020

Hi, the PR has been merged, can you check if it helps so we can close the issue.

@kindoblue
Copy link
Contributor Author

In vacation now. Next week I would be able to test the fix. Thanks

@kindoblue
Copy link
Contributor Author

kindoblue commented Aug 28, 2020

I tried again. The build process fails trying to build the plugin, with the error:

+ export DALI_TF_BUILDER_CONTAINER_MANYLINUX2010=extract_dali_tf_prebuilt_manylinux2010
+ DALI_TF_BUILDER_CONTAINER_MANYLINUX2010=extract_dali_tf_prebuilt_manylinux2010
+ docker run --gpus all --name extract_dali_tf_prebuilt_manylinux2010 nvidia/dali:cu100.build_tf_manylinux2010 /bin/bash -c 'source /opt/dali/dali_tf_plugin/build_in_custom_op_docker.sh'
++ set -e
++ PYTHON_DIST_PACKAGES=($(python -c "import site; print(site.getsitepackages()[0])"))
+++ python -c 'import site; print(site.getsitepackages()[0])'
++ DALI_TOPDIR=/usr/local/lib/python3.6/dist-packages/nvidia/dali
+++ cat /usr/local/cuda/version.txt
+++ head -1
+++ sed 's/.*Version \([0-9]\+\)\.\([0-9]\+\).*/\1\2/'
++ CUDA_VERSION=100
+++ python ../qa/setup_packages.py -n -u tensorflow-gpu --cuda 100
Traceback (most recent call last):
  File "../qa/setup_packages.py", line 409, in <module>
    main()
  File "../qa/setup_packages.py", line 400, in main
    print (cal_num_of_configs(args.use, args.cuda) - 1)
  File "../qa/setup_packages.py", line 365, in cal_num_of_configs
    ret *= pckg.get_num_of_version(cuda_version)
  File "../qa/setup_packages.py", line 140, in get_num_of_version
    return len(self.get_all_versions(cuda_version))
  File "../qa/setup_packages.py", line 218, in get_all_versions
    return self.filter_versions(self.versions[cuda_version])
  File "../qa/setup_packages.py", line 106, in filter_versions
    return [str(v) for v in versions if v]
  File "../qa/setup_packages.py", line 106, in <listcomp>
    return [str(v) for v in versions if v]
  File "../qa/setup_packages.py", line 46, in __bool__
    (not self.python_max_ver or parse(PYTHON_VERSION) <= parse(self.python_max_ver))
TypeError: 'module' object is not callable
++ LAST_CONFIG_INDEX=

I tried to edit the file ../qa/setup_packages.py but apparently is not taken into consideration (and the line numbers don't match) so perhaps is taken from a docker image?

PS: I used the original command line

BUILD_TF_PLUGIN=YES PYVER=3.7 CUDA_VERSION=10.0 ./build.sh

because only now I realize that PYVER=3.7 can be omitted

======================================
PPS: I even tried the following command to workaround the problem

BUILD_TF_PLUGIN=YES PREBUILD_TF_PLUGINS=NO CUDA_VERSION=10.0 ./build.sh

but then I get another error:

Writing nvidia-dali-tf-plugin-cuda100-0.26.0.dev0/setup.cfg
creating dist
Creating tar archive
removing 'nvidia-dali-tf-plugin-cuda100-0.26.0.dev0' (and everything under it)
++ cp dist/nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz /dali_tf_sdist
/opt/dali/dali_tf_plugin
++ popd
+ docker cp extract_dali_tf_sdist:/dali_tf_sdist/. dali_tf_sdist
+ cp dali_tf_sdist/nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz wheelhouse/
+ cp 'dali_tf_sdist/dummy/*.tar.gz' wheelhouse/dummy
cp: cannot stat 'dali_tf_sdist/dummy/*.tar.gz': No such file or directory
+ true
+ docker rm -f extract_dali_tf_sdist
extract_dali_tf_sdist
+ rm -rf dali_tf_plugin/whl
+ rm -rf dali_tf_sdist/
+ '[' NO == YES ']'
+ popd

@klecki
Copy link
Contributor

klecki commented Aug 28, 2020

Hmm, the source should be mounted into docker, I'm not sure what is going on in here.
It's controlled by the BUILD_INHOST env variable.
You may try to set REBUILD_BUILDERS env variable to YES so it will rebuild the docker images from scratch, maybe there are some leftovers.

I will check this on Monday if the issue still persists.

@JanuszL
Copy link
Contributor

JanuszL commented Aug 28, 2020

There is an additional step that prepare plugin builder image in build.sh. Please relaunch with REBUILD_BUILDERS=YES BUILD_TF_PLUGIN=YES PREBUILD_TF_PLUGINS=NO CUDA_VERSION=10.0 ./build.sh and see if that helps.

@kindoblue
Copy link
Contributor Author

First of all I pruned all the docker stuff on my system with the command:
docker system prune -a

Then I issued the command in the DALI/docker directory:
REBUILD_BUILDERS=YES BUILD_TF_PLUGIN=YES CUDA_VERSION=10.0 ./build.sh

Almost immediately the build script fails with an error similar to this one:
gliderlabs/docker-alpine#307

Probably it is due to my system (Ubuntu 20.04) but anyway I modified all the calls (in build.sh)
docker build...
with
docker build --network host...

After having compiled the half world now I have in wheelhouse directory the following files

➜  wheelhouse git:(master) ✗ ls -ltr
total 261760
-rw-r--r-- 1 ice ice 267728670 aug 29 08:45 nvidia_dali_cuda100-0.26.0.dev0-12345-py3-none-manylinux2014_x86_64.whl
-rw-r--r-- 1 ice ice    306643 aug 29 09:41 nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz

I don't see any whl for the dali tensorflow plugin, just a tar.gz. Is it supposed to be like this? Consider that the script is ending with the following output:

Creating tar archive
removing 'nvidia-dali-tf-plugin-cuda100-0.26.0.dev0' (and everything under it)
++ cp dist/nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz /dali_tf_sdist
/opt/dali/dali_tf_plugin
++ popd
+ docker cp extract_dali_tf_sdist:/dali_tf_sdist/. dali_tf_sdist
+ cp dali_tf_sdist/nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz wheelhouse/
+ cp 'dali_tf_sdist/dummy/*.tar.gz' wheelhouse/dummy
cp: cannot stat 'dali_tf_sdist/dummy/*.tar.gz': No such file or directory
+ true
+ docker rm -f extract_dali_tf_sdist
extract_dali_tf_sdist
+ rm -rf dali_tf_plugin/whl
+ rm -rf dali_tf_sdist/
+ '[' NO == YES ']'
+ popd

@klecki
Copy link
Contributor

klecki commented Aug 29, 2020

Yes, the Tensorflow Plugin is distributed as source distribution, hence the .tar.gz. If you kept the PREBUILD_TF_PLUGINS unchanged (it's YES by default) it will contain not only sources but the prebuilt plugin libraries.

During installation it will check if the prebuilt libraries are compatible with the Tensorflow distribution you are using and install them. If they are not compatible (for example you have a Tensroflow built on your machine with different compiler than expected), it will attempt to ask the Tensorflow for configuration and build the plugin libraries during installation. If that fails you will be notified what didn't match in the configuration.

@kindoblue
Copy link
Contributor Author

I've managed to install the tf plugin with this command (setting CFLAGS because it wanted to compile the thing)

CFLAGS="-I$CUDA_HOME/include $CFLAGS" pip install nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz

Well, it was not smooth but I finally managed to have the dali and the TF plugin compiled. Thanks for the help.

@JanuszL
Copy link
Contributor

JanuszL commented Aug 31, 2020

Hi,
I'm glad it works.
I don't think you need to issue docker system prune -a, REBUILD_BUILDERS=YES should rebuild it if needed or use a cached version in your system if there is no change in the code.

@JanuszL
Copy link
Contributor

JanuszL commented Oct 5, 2020

DALI 0.26 is available and should include the needed functionality.

@JanuszL JanuszL closed this as completed Oct 5, 2020
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants