[DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12 #5731

isratnisa · 2023-05-23T23:02:12Z

🐛 Bug

Dataloader cannot handle when number of sampler is more than 0 in distributed training for pytorch versions > 1.12. Could run the same script with PyTorch 1.12.

start training: elapsed time: 5.125, mem (curr: 2.251, peak: 2.251, shared: 0.562,                     global curr: 11.899, global shared: 72.445) GB
/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py:1772: UserWarning: You passed find_unused_parameters=true to DistributedDataParallel, `_set_static_graph` will detect unused parameters automatically, so you do not need to set find_unused_parameters=true, just be sure these unused parameters will not change during training loop while calling `_set_static_graph`.
  warnings.warn(
Client [160] waits on 172.31.28.52:52675
Machine (0) group (0) client (13) connect to server successfuly!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_context.py", line 101, in init_process
    collate_fn_dict[dataloader_name](collate_args),
KeyError: 'dataloader-0'
Process SpawnProcess-2:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_context.py", line 114, in init_process
    raise e
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_context.py", line 101, in init_process
    collate_fn_dict[dataloader_name](collate_args),
KeyError: 'dataloader-0'
Client [156] waits on 172.31.28.52:39043

Will add more details.

To Reproduce

Steps to reproduce the behavior:

Distributed training
(Will add more details)

Expected behavior

Environment

DGL Version (e.g., 1.0): 1.0.0
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): Pytorch 1.13 or pytorch 2.0
OS (e.g., Linux):
How you installed DGL (conda, pip, source): pip
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version (if applicable): 11.6
GPU models and configuration (e.g. V100): T4
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

Rhett-Ying · 2023-05-25T00:54:36Z

could you add more about how to reproduce this issue? share the key part of DistDataLoader?

Rhett-Ying · 2023-05-25T01:56:44Z

this issue happens even with num_samplers=1 ?

Rhett-Ying · 2023-05-25T06:50:26Z

when reproducing this issue, I hit an other known issue: #5528 (comment)

isratnisa · 2023-05-25T13:49:56Z

@Rhett-Ying I reproduced the issue on GraphStorm: awslabs/graphstorm#199

chang-l · 2023-05-31T22:13:43Z

Please check if it is a duplicate issue of #5480 due to a bug from PyT's ForkingPickler (stack trace may vary due to file/data racing).

github-actions · 2023-07-01T01:35:31Z

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions · 2023-08-05T01:31:27Z

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

isratnisa changed the title ~~[DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12~~ [DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12 May 24, 2023

isratnisa changed the title ~~[DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12~~ [DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12 May 24, 2023

github-actions bot added the stale-issue label Jul 1, 2023

Rhett-Ying removed the stale-issue label Jul 6, 2023

github-actions bot added the stale-issue label Aug 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12 #5731

[DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12 #5731

isratnisa commented May 23, 2023

Rhett-Ying commented May 25, 2023

Rhett-Ying commented May 25, 2023

Rhett-Ying commented May 25, 2023

isratnisa commented May 25, 2023 •

edited

Loading

chang-l commented May 31, 2023

github-actions bot commented Jul 1, 2023

github-actions bot commented Aug 5, 2023

[DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12 #5731

[DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12 #5731

Comments

isratnisa commented May 23, 2023

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Rhett-Ying commented May 25, 2023

Rhett-Ying commented May 25, 2023

Rhett-Ying commented May 25, 2023

isratnisa commented May 25, 2023 • edited Loading

chang-l commented May 31, 2023

github-actions bot commented Jul 1, 2023

github-actions bot commented Aug 5, 2023

isratnisa commented May 25, 2023 •

edited

Loading