Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

NCCL tests don't work on WSL #442

Open
PolKul opened this issue Dec 18, 2020 · 18 comments
Open

NCCL tests don't work on WSL #442

PolKul opened this issue Dec 18, 2020 · 18 comments

Comments

@PolKul
Copy link

PolKul commented Dec 18, 2020

I've installed NCCL and its tests on WSL. When trying to run a test like this:

NCCL_ALGO=Ring NCCL_PROTO=Simple NCCL_DEBUG_FILE=debug.%h.%p NCCL_DEBUG=INFO ./build/all_reduce_perf -b 128M -e 128M -g 1 -n 1 -w 0 -c 0 -m 0

I get the following error message:

nThread 1 nGpus 1 minBytes 134217728 maxBytes 134217728 step: 1048576(bytes) warmup iters: 0 iters: 1 validation: 0

Using devices
Rank 0 Pid 36629 on DESKTOP device 0 [0x21] TITAN RTX
NCCL version 2.8.3+cuda11.1
DESKTOP: Test NCCL failure common.cu:777 'unhandled system error'

The debug log shows this:

DESKTOP:36629:36629 [0] NCCL INFO Bootstrap : Using eth0:192.168.143.185<0>
DESKTOP:36629:36629 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

DESKTOP:36629:36629 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
DESKTOP:36629:36629 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.143.185<0>
DESKTOP:36629:36629 [0] NCCL INFO Using network Socket
DESKTOP:36629:36629 [0] NCCL INFO NCCL version 2.8.3+cuda11.1

DESKTOP:36629:36635 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:21/../../0000:21:00.0
DESKTOP:36629:36635 [0] NCCL INFO graph/xml.cc:469 -> 2
DESKTOP:36629:36635 [0] NCCL INFO graph/xml.cc:660 -> 2
DESKTOP:36629:36635 [0] NCCL INFO graph/topo.cc:522 -> 2
DESKTOP:36629:36635 [0] NCCL INFO init.cc:627 -> 2
DESKTOP:36629:36635 [0] NCCL INFO init.cc:878 -> 2
DESKTOP:36629:36635 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
DESKTOP:36629:36629 [0] NCCL INFO init.cc:946 -> 2

Version of NCCL: version 2.8.3
Version of CUDA: 11.1
Windows: 10.0.20277
WSL: Ubuntu 20.04

@Dango233
Copy link

I have exactly the same problem...

@AddyLaddy
Copy link
Collaborator

Thanks for these reports. Currently NCCL is not supported on WSL2 installations but we are working on validating it.

@PolKul
Copy link
Author

PolKul commented Jan 16, 2021

I think this is the reason why I cannot use multi-gpu training with PyTorch as well. Because when I use PyTorch DataParallel it give me similar error with NCCL.

@amannm
Copy link

amannm commented Jan 21, 2021

Thanks for these reports. Currently NCCL is not supported on WSL2 installations but we are working on validating it.

I also ran into the issue of NCCL simply not supporting WSL environments. It would have helped to have the lack of support documented right here https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations

This might be the only place on the net a dev has said anything on the topic.

@monotaro3
Copy link

Maybe I have the same error. I'm trying to use multigpu in two nodes where the one is wsl2 environment but seems that nccl communicator hangs displaying "cupy.cuda.nccl.NcclError: NCCL_ERROR_SYSTEM_ERROR: unhandled system error" only in the wsl2 side. Looking forward to the fix.

@jogiji
Copy link

jogiji commented Jul 10, 2021

Any update on this issue.. NCCL support for WSL2 is needed so that i can use Transfer Learning Toolkit 3 on my Windows desktop using WSL2

@AddyLaddy
Copy link
Collaborator

NCCL 2.10.3 was released last week and it should support WSL2 with a single GPU. Multi-GPU has not been validated yet.

@jogiji
Copy link

jogiji commented Sep 3, 2021

Still doesn't work with latest upgrades to TAO on WSL2 with newest driver 510.06... following is the output :
FYI I am trying to run the latest TAO toolkit from NGC on docker in wsl2.
I have an RTX 3090 GPU for the same.

`Epoch 1/80
  1/238 [..............................] - ETA: 35:21 - loss: 3.4665 - acc: 0.0938WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:146: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

2021-09-03 03:34:33,108 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:146: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

  2/238 [..............................] - ETA: 19:56 - loss: 3.6723 - acc: 0.0625/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (0.548347). Check your callbacks.
  % delta_t_median)
238/238 [==============================] - 40s 169ms/step - loss: 2.1327 - acc: 0.4371 - val_loss: 1.5816 - val_acc: 0.5542
96d216ed9f8a:127:179 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.18.0.2<0>
96d216ed9f8a:127:179 [0] NCCL INFO NET/Plugin : Plugin load returned 0 : libnccl-net.so: cannot open shared object file: No such file or directory.
96d216ed9f8a:127:179 [0] NCCL INFO NET/IB : No device found.
96d216ed9f8a:127:179 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.18.0.2<0>
96d216ed9f8a:127:179 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1

96d216ed9f8a:127:179 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:0a/../../0000:0a:00.0
96d216ed9f8a:127:179 [0] NCCL INFO graph/xml.cc:469 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO graph/xml.cc:660 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO graph/topo.cc:523 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO init.cc:581 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO init.cc:840 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO init.cc:876 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO init.cc:887 -> 2
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: ncclCommInitRank failed: unhandled system error
	 [[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
  (1) Unknown: ncclCommInitRank failed: unhandled system error
	 [[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
	 [[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 500, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 494, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 482, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 495, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 468, in run_experiment
  File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 251, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/usr/local/lib/python3.6/dist-packages/keras/callbacks.py", line 79, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 84, in on_epoch_end
    self._average_metrics_in_place(logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 77, in _average_metrics_in_place
    self.backend.get_session().run(self.allreduce_ops[metric])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: ncclCommInitRank failed: unhandled system error
	 [[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Unknown: ncclCommInitRank failed: unhandled system error
	 [[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0':
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 500, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 482, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 495, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 468, in run_experiment
  File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 251, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/usr/local/lib/python3.6/dist-packages/keras/callbacks.py", line 79, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 84, in on_epoch_end
    self._average_metrics_in_place(logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 73, in _average_metrics_in_place
    self._make_variable(metric, value)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 58, in _make_variable
    allreduce_op = hvd.allreduce(var, device_dense=self.device)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 80, in allreduce
    summed_tensor_compressed = _allreduce(tensor_compressed)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 86, in _allreduce
    return MPI_LIB.horovod_allreduce(tensor, name=name)
  File "<string>", line 80, in horovod_allreduce
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

2021-09-03 09:05:06,420 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.`

@sjeaugey
Copy link
Member

sjeaugey commented Sep 3, 2021

From your log:

NCCL version 2.7.8+cuda11.1

Note, NCCL might have been compiled statically with tensorflow, so upgrading NCCL might not be enough to use the newest version.

@tanzhenyu
Copy link

The current status should be that NCCL isn't supported (on multiple GPUs) for WSL.

@softmatic
Copy link

Same issue here with WSL2 (Windows 11), driver 510.06 and torch 1.9.1.cu111 with 2x 2080 Super.

@AddyLaddy
Copy link
Collaborator

NCCL 2.11.4 has been tested on multi-GPU Win11 systems. I don't know what drivers and OS level are required though. You need to make sure that your pytorch/tensorflow subsystem hasn't been statically linked against an older NCCL version.

@softmatic
Copy link

@AddyLaddy Thanks for getting back to me. I checked and Torch 1.9.1.cu111 apparently uses NCCL 2.7.8. Will have to see what our options are now.

@cascgu
Copy link

cascgu commented Oct 28, 2021

NCCL 2.11.4 has been tested on multi-GPU Win11 systems. I don't know what drivers and OS level are required though. You need to make sure that your pytorch/tensorflow subsystem hasn't been statically linked against an older NCCL version.

@AddyLaddy How can I unlink the old NCCL from pytorch and update the NCCL of pytorch to version 2.11.4? I have installed version 2.11.4 in wsl2 and can pass the test by using nccl-tests. However, when training the model, pytorch 1.7.1 still calls NCCL 2.7.8

@AddyLaddy
Copy link
Collaborator

I'm not a PyTorch expert, but I believe you need to configure and rebuild it using the USE_SYSTEM_NCCL=1 option. Perhaps ask in a PyTorch forum for help?

@cascgu
Copy link

cascgu commented Oct 31, 2021

@AddyLaddy Thank you very much. I'll try to recompile PyTorch.

@Chan0081
Copy link

@AddyLaddy Thank you very much. I'll try to recompile PyTorch.

hi. I've got the same issue recently. Did it work to recompile PyTorch?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests