Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12 #5731

Open
isratnisa opened this issue May 23, 2023 · 7 comments

Comments

@isratnisa
Copy link
Collaborator

🐛 Bug

Dataloader cannot handle when number of sampler is more than 0 in distributed training for pytorch versions > 1.12. Could run the same script with PyTorch 1.12.

start training: elapsed time: 5.125, mem (curr: 2.251, peak: 2.251, shared: 0.562,                     global curr: 11.899, global shared: 72.445) GB
/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py:1772: UserWarning: You passed find_unused_parameters=true to DistributedDataParallel, `_set_static_graph` will detect unused parameters automatically, so you do not need to set find_unused_parameters=true, just be sure these unused parameters will not change during training loop while calling `_set_static_graph`.
  warnings.warn(
Client [160] waits on 172.31.28.52:52675
Machine (0) group (0) client (13) connect to server successfuly!
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_context.py", line 101, in init_process
    collate_fn_dict[dataloader_name](collate_args),
KeyError: 'dataloader-0'
Process SpawnProcess-2:
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_context.py", line 114, in init_process
    raise e
  File "/usr/local/lib/python3.8/dist-packages/dgl/distributed/dist_context.py", line 101, in init_process
    collate_fn_dict[dataloader_name](collate_args),
KeyError: 'dataloader-0'
Client [156] waits on 172.31.28.52:39043

Will add more details.

To Reproduce

Steps to reproduce the behavior:

  1. Distributed training
    (Will add more details)

Expected behavior

Environment

  • DGL Version (e.g., 1.0): 1.0.0
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): Pytorch 1.13 or pytorch 2.0
  • OS (e.g., Linux):
  • How you installed DGL (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version (if applicable): 11.6
  • GPU models and configuration (e.g. V100): T4
  • Any other relevant information:

Additional context

@isratnisa isratnisa changed the title [DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12 [DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12 May 24, 2023
@isratnisa isratnisa changed the title [DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12 [DistDGL] Dataloader throws error when sampler is not 0 for torch versions > 1.12 May 24, 2023
@Rhett-Ying
Copy link
Collaborator

could you add more about how to reproduce this issue? share the key part of DistDataLoader?

@Rhett-Ying
Copy link
Collaborator

this issue happens even with num_samplers=1 ?

@Rhett-Ying
Copy link
Collaborator

when reproducing this issue, I hit an other known issue: #5528 (comment)

@isratnisa
Copy link
Collaborator Author

isratnisa commented May 25, 2023

@Rhett-Ying I reproduced the issue on GraphStorm: awslabs/graphstorm#199

@chang-l
Copy link
Collaborator

chang-l commented May 31, 2023

Please check if it is a duplicate issue of #5480 due to a bug from PyT's ForkingPickler (stack trace may vary due to file/data racing).

@github-actions
Copy link

github-actions bot commented Jul 1, 2023

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

@github-actions
Copy link

github-actions bot commented Aug 5, 2023

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

3 participants