NCCL tests don't work on WSL #442

PolKul · 2020-12-18T06:11:08Z

I've installed NCCL and its tests on WSL. When trying to run a test like this:

NCCL_ALGO=Ring NCCL_PROTO=Simple NCCL_DEBUG_FILE=debug.%h.%p NCCL_DEBUG=INFO ./build/all_reduce_perf -b 128M -e 128M -g 1 -n 1 -w 0 -c 0 -m 0

I get the following error message:

nThread 1 nGpus 1 minBytes 134217728 maxBytes 134217728 step: 1048576(bytes) warmup iters: 0 iters: 1 validation: 0

Using devices
Rank 0 Pid 36629 on DESKTOP device 0 [0x21] TITAN RTX
NCCL version 2.8.3+cuda11.1
DESKTOP: Test NCCL failure common.cu:777 'unhandled system error'

The debug log shows this:

DESKTOP:36629:36629 [0] NCCL INFO Bootstrap : Using eth0:192.168.143.185<0>
DESKTOP:36629:36629 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

DESKTOP:36629:36629 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
DESKTOP:36629:36629 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.143.185<0>
DESKTOP:36629:36629 [0] NCCL INFO Using network Socket
DESKTOP:36629:36629 [0] NCCL INFO NCCL version 2.8.3+cuda11.1

DESKTOP:36629:36635 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:21/../../0000:21:00.0
DESKTOP:36629:36635 [0] NCCL INFO graph/xml.cc:469 -> 2
DESKTOP:36629:36635 [0] NCCL INFO graph/xml.cc:660 -> 2
DESKTOP:36629:36635 [0] NCCL INFO graph/topo.cc:522 -> 2
DESKTOP:36629:36635 [0] NCCL INFO init.cc:627 -> 2
DESKTOP:36629:36635 [0] NCCL INFO init.cc:878 -> 2
DESKTOP:36629:36635 [0] NCCL INFO group.cc:72 -> 2 [Async thread]
DESKTOP:36629:36629 [0] NCCL INFO init.cc:946 -> 2

Version of NCCL: version 2.8.3
Version of CUDA: 11.1
Windows: 10.0.20277
WSL: Ubuntu 20.04

Dango233 · 2021-01-15T09:51:56Z

I have exactly the same problem...

AddyLaddy · 2021-01-15T17:07:21Z

Thanks for these reports. Currently NCCL is not supported on WSL2 installations but we are working on validating it.

PolKul · 2021-01-16T05:30:51Z

I think this is the reason why I cannot use multi-gpu training with PyTorch as well. Because when I use PyTorch DataParallel it give me similar error with NCCL.

amannm · 2021-01-21T05:10:47Z

Thanks for these reports. Currently NCCL is not supported on WSL2 installations but we are working on validating it.

I also ran into the issue of NCCL simply not supporting WSL environments. It would have helped to have the lack of support documented right here https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations

This might be the only place on the net a dev has said anything on the topic.

monotaro3 · 2021-03-19T07:02:05Z

Maybe I have the same error. I'm trying to use multigpu in two nodes where the one is wsl2 environment but seems that nccl communicator hangs displaying "cupy.cuda.nccl.NcclError: NCCL_ERROR_SYSTEM_ERROR: unhandled system error" only in the wsl2 side. Looking forward to the fix.

jogiji · 2021-07-10T06:00:51Z

Any update on this issue.. NCCL support for WSL2 is needed so that i can use Transfer Learning Toolkit 3 on my Windows desktop using WSL2

AddyLaddy · 2021-07-10T23:05:56Z

NCCL 2.10.3 was released last week and it should support WSL2 with a single GPU. Multi-GPU has not been validated yet.

jogiji · 2021-09-03T03:42:36Z

Still doesn't work with latest upgrades to TAO on WSL2 with newest driver 510.06... following is the output :
FYI I am trying to run the latest TAO toolkit from NGC on docker in wsl2.
I have an RTX 3090 GPU for the same.

`Epoch 1/80
  1/238 [..............................] - ETA: 35:21 - loss: 3.4665 - acc: 0.0938WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:146: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

2021-09-03 03:34:33,108 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:146: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

  2/238 [..............................] - ETA: 19:56 - loss: 3.6723 - acc: 0.0625/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (0.548347). Check your callbacks.
  % delta_t_median)
238/238 [==============================] - 40s 169ms/step - loss: 2.1327 - acc: 0.4371 - val_loss: 1.5816 - val_acc: 0.5542
96d216ed9f8a:127:179 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.18.0.2<0>
96d216ed9f8a:127:179 [0] NCCL INFO NET/Plugin : Plugin load returned 0 : libnccl-net.so: cannot open shared object file: No such file or directory.
96d216ed9f8a:127:179 [0] NCCL INFO NET/IB : No device found.
96d216ed9f8a:127:179 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.18.0.2<0>
96d216ed9f8a:127:179 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1

96d216ed9f8a:127:179 [0] graph/xml.cc:332 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:0a/../../0000:0a:00.0
96d216ed9f8a:127:179 [0] NCCL INFO graph/xml.cc:469 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO graph/xml.cc:660 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO graph/topo.cc:523 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO init.cc:581 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO init.cc:840 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO init.cc:876 -> 2
96d216ed9f8a:127:179 [0] NCCL INFO init.cc:887 -> 2
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: ncclCommInitRank failed: unhandled system error
	 [[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
  (1) Unknown: ncclCommInitRank failed: unhandled system error
	 [[{{node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0}}]]
	 [[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 500, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 494, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 482, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 495, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 468, in run_experiment
  File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 251, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/usr/local/lib/python3.6/dist-packages/keras/callbacks.py", line 79, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 84, in on_epoch_end
    self._average_metrics_in_place(logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 77, in _average_metrics_in_place
    self.backend.get_session().run(self.allreduce_ops[metric])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: ncclCommInitRank failed: unhandled system error
	 [[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Unknown: ncclCommInitRank failed: unhandled system error
	 [[node MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[MetricAverageCallback/truediv/_5113]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'MetricAverageCallback/HorovodAllreduce_MetricAverageCallback_acc_0':
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 500, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 482, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 495, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/makenet/scripts/train.py", line 468, in run_experiment
  File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 251, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/usr/local/lib/python3.6/dist-packages/keras/callbacks.py", line 79, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 84, in on_epoch_end
    self._average_metrics_in_place(logs)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 73, in _average_metrics_in_place
    self._make_variable(metric, value)
  File "/usr/local/lib/python3.6/dist-packages/horovod/_keras/callbacks.py", line 58, in _make_variable
    allreduce_op = hvd.allreduce(var, device_dense=self.device)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 80, in allreduce
    summed_tensor_compressed = _allreduce(tensor_compressed)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_ops.py", line 86, in _allreduce
    return MPI_LIB.horovod_allreduce(tensor, name=name)
  File "<string>", line 80, in horovod_allreduce
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

2021-09-03 09:05:06,420 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.`

sjeaugey · 2021-09-03T07:05:50Z

From your log:

NCCL version 2.7.8+cuda11.1

Note, NCCL might have been compiled statically with tensorflow, so upgrading NCCL might not be enough to use the newest version.

tanzhenyu · 2021-09-10T15:00:50Z

The current status should be that NCCL isn't supported (on multiple GPUs) for WSL.

softmatic · 2021-10-20T09:29:50Z

Same issue here with WSL2 (Windows 11), driver 510.06 and torch 1.9.1.cu111 with 2x 2080 Super.

AddyLaddy · 2021-10-20T15:29:17Z

NCCL 2.11.4 has been tested on multi-GPU Win11 systems. I don't know what drivers and OS level are required though. You need to make sure that your pytorch/tensorflow subsystem hasn't been statically linked against an older NCCL version.

softmatic · 2021-10-20T15:46:14Z

@AddyLaddy Thanks for getting back to me. I checked and Torch 1.9.1.cu111 apparently uses NCCL 2.7.8. Will have to see what our options are now.

cascgu · 2021-10-28T13:35:11Z

NCCL 2.11.4 has been tested on multi-GPU Win11 systems. I don't know what drivers and OS level are required though. You need to make sure that your pytorch/tensorflow subsystem hasn't been statically linked against an older NCCL version.

@AddyLaddy How can I unlink the old NCCL from pytorch and update the NCCL of pytorch to version 2.11.4? I have installed version 2.11.4 in wsl2 and can pass the test by using nccl-tests. However, when training the model, pytorch 1.7.1 still calls NCCL 2.7.8

AddyLaddy · 2021-10-28T16:41:56Z

I'm not a PyTorch expert, but I believe you need to configure and rebuild it using the USE_SYSTEM_NCCL=1 option. Perhaps ask in a PyTorch forum for help?

cascgu · 2021-10-31T09:09:04Z

@AddyLaddy Thank you very much. I'll try to recompile PyTorch.

Chan0081 · 2023-04-10T09:31:40Z

@AddyLaddy Thank you very much. I'll try to recompile PyTorch.

hi. I've got the same issue recently. Did it work to recompile PyTorch?

lix19937 · 2024-04-09T10:19:52Z

also #573
pytorch/pytorch#73790
https://stackoverflow.com/questions/66807131/how-to-solve-the-famous-unhandled-cuda-error-nccl-version-2-7-8-error
https://discuss.pytorch.org/t/a6000-nccl-warn-failed-to-open-libibverbs-so/145630

brightbsit mentioned this issue Feb 2, 2021

NCCL Error when training with 2x 3090s. pytorch/pytorch#49095

Open

ImanHosseini mentioned this issue Feb 2, 2021

RuntimeError: NCCL error NVlabs/NVAE#9

Open

ljz756245026 mentioned this issue Sep 27, 2021

whether nccl do not support virsual machines #575

Closed

MendelXu mentioned this issue Oct 9, 2021

Error in training microsoft/SoftTeacher#42

Closed

tymons mentioned this issue Nov 10, 2021

NCCL WARN Could not find real path of... #573

Open

softmatic mentioned this issue Nov 17, 2021

CoreML conversion/export and usage (non-max suppression) ultralytics/yolov5#5157

Closed

leofang mentioned this issue Dec 8, 2021

Windows support? conda-forge/nccl-feedstock#59

Open

benvanik mentioned this issue Jun 21, 2022

First step of adding nccl headers to the build. iree-org/iree#9551

Closed

santhnm2 mentioned this issue Aug 3, 2022

Running Intro Notebook on WSL stanford-futuredata/ColBERT#122

Closed

jklj077 mentioned this issue Oct 25, 2023

QLora单机二卡微调卡住 QwenLM/Qwen#526

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NCCL tests don't work on WSL #442

NCCL tests don't work on WSL #442

PolKul commented Dec 18, 2020 •

edited

Loading

Dango233 commented Jan 15, 2021

AddyLaddy commented Jan 15, 2021

PolKul commented Jan 16, 2021

amannm commented Jan 21, 2021

monotaro3 commented Mar 19, 2021

jogiji commented Jul 10, 2021

AddyLaddy commented Jul 10, 2021

jogiji commented Sep 3, 2021 •

edited

Loading

sjeaugey commented Sep 3, 2021

tanzhenyu commented Sep 10, 2021

softmatic commented Oct 20, 2021

AddyLaddy commented Oct 20, 2021

softmatic commented Oct 20, 2021

cascgu commented Oct 28, 2021

AddyLaddy commented Oct 28, 2021

cascgu commented Oct 31, 2021

Chan0081 commented Apr 10, 2023

lix19937 commented Apr 9, 2024 •

edited

Loading

NCCL tests don't work on WSL #442

NCCL tests don't work on WSL #442

Comments

PolKul commented Dec 18, 2020 • edited Loading

Dango233 commented Jan 15, 2021

AddyLaddy commented Jan 15, 2021

PolKul commented Jan 16, 2021

amannm commented Jan 21, 2021

monotaro3 commented Mar 19, 2021

jogiji commented Jul 10, 2021

AddyLaddy commented Jul 10, 2021

jogiji commented Sep 3, 2021 • edited Loading

sjeaugey commented Sep 3, 2021

tanzhenyu commented Sep 10, 2021

softmatic commented Oct 20, 2021

AddyLaddy commented Oct 20, 2021

softmatic commented Oct 20, 2021

cascgu commented Oct 28, 2021

AddyLaddy commented Oct 28, 2021

cascgu commented Oct 31, 2021

Chan0081 commented Apr 10, 2023

lix19937 commented Apr 9, 2024 • edited Loading

PolKul commented Dec 18, 2020 •

edited

Loading

jogiji commented Sep 3, 2021 •

edited

Loading

lix19937 commented Apr 9, 2024 •

edited

Loading