Compiling the TF plugin (using docker) fails #2213

kindoblue · 2020-08-15T12:01:11Z

I'm trying to compile Dali with the following command, in the docker directory

BUILD_TF_PLUGIN=YES PYVER=3.7 CUDA_VERSION=10.0 ./build.sh

Dali got compiled and the wheel generated but then the script starts to build the TF plugin and I get the following error.

+ nvidia-docker run --name extract_dali_tf_prebuilt_manylinux1 nvidia/dali:cu100.build_tf_manylinux1 /bin/bash -c 'source /opt/dali/dali_tf_plugin/build_in_custom_op_docker.sh'
./build.sh: line 261: nvidia-docker: command not found

I don't recall reading in the documentation about installing nvidia-docker. Is it really needed for building a plugin within a docker image?

On Ubuntu 20.04. Using docker script.

The text was updated successfully, but these errors were encountered:

kindoblue · 2020-08-15T12:54:38Z

I've installed nvidia-docker and the build process goes further but yet ends with an error

+ nvidia-docker run --name extract_dali_tf_prebuilt_manylinux1 nvidia/dali:cu100.build_tf_manylinux1 /bin/bash -c 'source /opt/dali/dali_tf_plugin/build_in_custom_op_docker.sh'
++ set -e
++ PYTHON_DIST_PACKAGES=($(python -c "import site; print(site.getsitepackages()[0])"))
+++ python -c 'import site; print(site.getsitepackages()[0])'
++ DALI_TOPDIR=/usr/local/lib/python2.7/dist-packages/nvidia/dali
+++ cat /usr/local/cuda/version.txt
+++ head -1
+++ sed 's/.*Version \([0-9]\+\)\.\([0-9]\+\).*/\1\2/'
++ CUDA_VERSION=100
+++ python ../qa/setup_packages.py -n -u tensorflow-gpu --cuda 100
Traceback (most recent call last):
  File "../qa/setup_packages.py", line 6, in <module>
    import urllib.parse
ImportError: No module named parse
++ LAST_CONFIG_INDEX=

klecki · 2020-08-17T11:01:12Z

Hi @kindoblue,
sorry for the confusion with the nvidia-docker, I will update the documentation and script so it can support the new syntax from NVIDIA Container Toolkit (docker run --gpus all).

As for the second error, it looks like the script does not propagate the Python version properly and uses Python 2.7 for the TF Plugin containers. I will try to post a fix soon, will get back to you when I have a PR.

Thanks for reporting that.

klecki · 2020-08-17T15:27:27Z

I adjusted the scripts and docs a bit in #2214.

On my machine it successfully built both the wheel and TF plugin.

What's worth to mention, we recently started building one wheel that is compatible with several minor Python versions.
It should be enough to use the script as follows, omitting the PYVER:

BUILD_TF_PLUGIN=YES CUDA_VERSION=10.0 ./build.sh

klecki · 2020-08-24T11:07:35Z

Hi, the PR has been merged, can you check if it helps so we can close the issue.

kindoblue · 2020-08-24T11:32:34Z

In vacation now. Next week I would be able to test the fix. Thanks

kindoblue · 2020-08-28T17:47:23Z

I tried again. The build process fails trying to build the plugin, with the error:

+ export DALI_TF_BUILDER_CONTAINER_MANYLINUX2010=extract_dali_tf_prebuilt_manylinux2010
+ DALI_TF_BUILDER_CONTAINER_MANYLINUX2010=extract_dali_tf_prebuilt_manylinux2010
+ docker run --gpus all --name extract_dali_tf_prebuilt_manylinux2010 nvidia/dali:cu100.build_tf_manylinux2010 /bin/bash -c 'source /opt/dali/dali_tf_plugin/build_in_custom_op_docker.sh'
++ set -e
++ PYTHON_DIST_PACKAGES=($(python -c "import site; print(site.getsitepackages()[0])"))
+++ python -c 'import site; print(site.getsitepackages()[0])'
++ DALI_TOPDIR=/usr/local/lib/python3.6/dist-packages/nvidia/dali
+++ cat /usr/local/cuda/version.txt
+++ head -1
+++ sed 's/.*Version \([0-9]\+\)\.\([0-9]\+\).*/\1\2/'
++ CUDA_VERSION=100
+++ python ../qa/setup_packages.py -n -u tensorflow-gpu --cuda 100
Traceback (most recent call last):
  File "../qa/setup_packages.py", line 409, in <module>
    main()
  File "../qa/setup_packages.py", line 400, in main
    print (cal_num_of_configs(args.use, args.cuda) - 1)
  File "../qa/setup_packages.py", line 365, in cal_num_of_configs
    ret *= pckg.get_num_of_version(cuda_version)
  File "../qa/setup_packages.py", line 140, in get_num_of_version
    return len(self.get_all_versions(cuda_version))
  File "../qa/setup_packages.py", line 218, in get_all_versions
    return self.filter_versions(self.versions[cuda_version])
  File "../qa/setup_packages.py", line 106, in filter_versions
    return [str(v) for v in versions if v]
  File "../qa/setup_packages.py", line 106, in <listcomp>
    return [str(v) for v in versions if v]
  File "../qa/setup_packages.py", line 46, in __bool__
    (not self.python_max_ver or parse(PYTHON_VERSION) <= parse(self.python_max_ver))
TypeError: 'module' object is not callable
++ LAST_CONFIG_INDEX=

I tried to edit the file ../qa/setup_packages.py but apparently is not taken into consideration (and the line numbers don't match) so perhaps is taken from a docker image?

PS: I used the original command line

BUILD_TF_PLUGIN=YES PYVER=3.7 CUDA_VERSION=10.0 ./build.sh

because only now I realize that PYVER=3.7 can be omitted

======================================
PPS: I even tried the following command to workaround the problem

BUILD_TF_PLUGIN=YES PREBUILD_TF_PLUGINS=NO CUDA_VERSION=10.0 ./build.sh

but then I get another error:

Writing nvidia-dali-tf-plugin-cuda100-0.26.0.dev0/setup.cfg
creating dist
Creating tar archive
removing 'nvidia-dali-tf-plugin-cuda100-0.26.0.dev0' (and everything under it)
++ cp dist/nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz /dali_tf_sdist
/opt/dali/dali_tf_plugin
++ popd
+ docker cp extract_dali_tf_sdist:/dali_tf_sdist/. dali_tf_sdist
+ cp dali_tf_sdist/nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz wheelhouse/
+ cp 'dali_tf_sdist/dummy/*.tar.gz' wheelhouse/dummy
cp: cannot stat 'dali_tf_sdist/dummy/*.tar.gz': No such file or directory
+ true
+ docker rm -f extract_dali_tf_sdist
extract_dali_tf_sdist
+ rm -rf dali_tf_plugin/whl
+ rm -rf dali_tf_sdist/
+ '[' NO == YES ']'
+ popd

klecki · 2020-08-28T19:49:09Z

Hmm, the source should be mounted into docker, I'm not sure what is going on in here.
It's controlled by the BUILD_INHOST env variable.
You may try to set REBUILD_BUILDERS env variable to YES so it will rebuild the docker images from scratch, maybe there are some leftovers.

I will check this on Monday if the issue still persists.

JanuszL · 2020-08-28T20:25:14Z

There is an additional step that prepare plugin builder image in build.sh. Please relaunch with REBUILD_BUILDERS=YES BUILD_TF_PLUGIN=YES PREBUILD_TF_PLUGINS=NO CUDA_VERSION=10.0 ./build.sh and see if that helps.

kindoblue · 2020-08-29T07:56:21Z

First of all I pruned all the docker stuff on my system with the command:
docker system prune -a

Then I issued the command in the DALI/docker directory:
REBUILD_BUILDERS=YES BUILD_TF_PLUGIN=YES CUDA_VERSION=10.0 ./build.sh

Almost immediately the build script fails with an error similar to this one:
gliderlabs/docker-alpine#307

Probably it is due to my system (Ubuntu 20.04) but anyway I modified all the calls (in build.sh)
docker build...
with
docker build --network host...

After having compiled the half world now I have in wheelhouse directory the following files

➜  wheelhouse git:(master) ✗ ls -ltr
total 261760
-rw-r--r-- 1 ice ice 267728670 aug 29 08:45 nvidia_dali_cuda100-0.26.0.dev0-12345-py3-none-manylinux2014_x86_64.whl
-rw-r--r-- 1 ice ice    306643 aug 29 09:41 nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz

I don't see any whl for the dali tensorflow plugin, just a tar.gz. Is it supposed to be like this? Consider that the script is ending with the following output:

Creating tar archive
removing 'nvidia-dali-tf-plugin-cuda100-0.26.0.dev0' (and everything under it)
++ cp dist/nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz /dali_tf_sdist
/opt/dali/dali_tf_plugin
++ popd
+ docker cp extract_dali_tf_sdist:/dali_tf_sdist/. dali_tf_sdist
+ cp dali_tf_sdist/nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz wheelhouse/
+ cp 'dali_tf_sdist/dummy/*.tar.gz' wheelhouse/dummy
cp: cannot stat 'dali_tf_sdist/dummy/*.tar.gz': No such file or directory
+ true
+ docker rm -f extract_dali_tf_sdist
extract_dali_tf_sdist
+ rm -rf dali_tf_plugin/whl
+ rm -rf dali_tf_sdist/
+ '[' NO == YES ']'
+ popd

klecki · 2020-08-29T10:42:40Z

Yes, the Tensorflow Plugin is distributed as source distribution, hence the .tar.gz. If you kept the PREBUILD_TF_PLUGINS unchanged (it's YES by default) it will contain not only sources but the prebuilt plugin libraries.

During installation it will check if the prebuilt libraries are compatible with the Tensorflow distribution you are using and install them. If they are not compatible (for example you have a Tensroflow built on your machine with different compiler than expected), it will attempt to ask the Tensorflow for configuration and build the plugin libraries during installation. If that fails you will be notified what didn't match in the configuration.

kindoblue · 2020-08-29T12:46:47Z

I've managed to install the tf plugin with this command (setting CFLAGS because it wanted to compile the thing)

CFLAGS="-I$CUDA_HOME/include $CFLAGS" pip install nvidia-dali-tf-plugin-cuda100-0.26.0.dev0.tar.gz

Well, it was not smooth but I finally managed to have the dali and the TF plugin compiled. Thanks for the help.

JanuszL · 2020-08-31T08:43:39Z

Hi,
I'm glad it works.
I don't think you need to issue docker system prune -a, REBUILD_BUILDERS=YES should rebuild it if needed or use a cached version in your system if there is no change in the code.

JanuszL · 2020-10-05T17:08:33Z

DALI 0.26 is available and should include the needed functionality.

klecki added the bug Something isn't working label Aug 17, 2020

klecki mentioned this issue Aug 17, 2020

Fix docker/build.sh to use Python 3 for TF plugin #2214

Merged

klecki added this to the Release_0.26.0 milestone Aug 19, 2020

JanuszL closed this as completed Oct 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compiling the TF plugin (using docker) fails #2213

Compiling the TF plugin (using docker) fails #2213

kindoblue commented Aug 15, 2020 •

edited

Loading

kindoblue commented Aug 15, 2020

klecki commented Aug 17, 2020 •

edited

Loading

klecki commented Aug 17, 2020

klecki commented Aug 24, 2020

kindoblue commented Aug 24, 2020

kindoblue commented Aug 28, 2020 •

edited

Loading

klecki commented Aug 28, 2020

JanuszL commented Aug 28, 2020

kindoblue commented Aug 29, 2020

klecki commented Aug 29, 2020

kindoblue commented Aug 29, 2020

JanuszL commented Aug 31, 2020

JanuszL commented Oct 5, 2020

Compiling the TF plugin (using docker) fails #2213

Compiling the TF plugin (using docker) fails #2213

Comments

kindoblue commented Aug 15, 2020 • edited Loading

kindoblue commented Aug 15, 2020

klecki commented Aug 17, 2020 • edited Loading

klecki commented Aug 17, 2020

klecki commented Aug 24, 2020

kindoblue commented Aug 24, 2020

kindoblue commented Aug 28, 2020 • edited Loading

klecki commented Aug 28, 2020

JanuszL commented Aug 28, 2020

kindoblue commented Aug 29, 2020

klecki commented Aug 29, 2020

kindoblue commented Aug 29, 2020

JanuszL commented Aug 31, 2020

JanuszL commented Oct 5, 2020

kindoblue commented Aug 15, 2020 •

edited

Loading

klecki commented Aug 17, 2020 •

edited

Loading

kindoblue commented Aug 28, 2020 •

edited

Loading