-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Issues with multiple samplers on torch 1.13 #199
Comments
Related issues: dmlc/dgl#5480, dmlc/dgl#5528, |
for bug 1, I think it's worth trying with the suggestion:
|
Hi @isratnisa , to verify if it is a same issue as dmlc/dgl#5480, can you please try revert the problematic commit (pytorch/pytorch@b25a1ce) or rebuild PyT from TOT to see if it works? |
Hi, I am facing the same error with #199 (comment). I have tried both Run command:
Error:
|
|
Hi @isratnisa , I wonder if you are using the docker when getting this error? I tried without docker using In fact, I also tried |
I get the following numbers by checking the open files limits. It seems like this is not the root cause given that those numbers are quite large:
Besides, I have also tried
This seems like a deadlock when saving models, not really multi-sampler issue though. |
Torch 2.0.1 resolves the issue. Verified with |
…377) Resolves issue #199 Updating the torch version from `torch==1.13` to `torch==2.1.0` in the docker file. Torch versions later than `1.12` had a bug which did not allow us to use `num_samplers` > 0. In Pytorch 2.1.0 release the bug is resolved. We have verified the solution through the following experiments. #### Experiment setup: Dataset: ogbn-mag (partitioned into 2) DGL versions: '1.0.4+cu117' and '1.1.1+cu113' Torch versions: '2.1.0+cu118' ### Experiment 1: 1 trainer and 4 samplers ``` python3 -u /dgl/tools/launch.py --workspace /graph-storm/python/graphstorm/run/gsgnn_lp --num_trainers 1 --num_servers 1 --num_samplers 4 --part_config /data/ogbn_mag_lp_2p/ogbn-mag.json --ip_config /data/ip_list_p2.txt --ssh_port 2222 --graph_format csc,coo "python3 /graph-storm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf /data/mag_2p_lp.yaml --node-feat-name paper:feat --no-validation true" ``` Output: ``` Epoch 00000 | Batch 000 | Train Loss: 13.5191 | Time: 3.2363 Epoch 00000 | Batch 020 | Train Loss: 3.2547 | Time: 0.4499 Epoch 00000 | Batch 040 | Train Loss: 2.0744 | Time: 0.5477 Epoch 00000 | Batch 060 | Train Loss: 1.6599 | Time: 0.5524 Epoch 00000 | Batch 080 | Train Loss: 1.4543 | Time: 0.4597 Epoch 00000 | Batch 100 | Train Loss: 1.2397 | Time: 0.4665 Epoch 00000 | Batch 120 | Train Loss: 1.0915 | Time: 0.4823 Epoch 00000 | Batch 140 | Train Loss: 0.9683 | Time: 0.4576 Epoch 00000 | Batch 160 | Train Loss: 0.8798 | Time: 0.5382 Epoch 00000 | Batch 180 | Train Loss: 0.7762 | Time: 0.5681 Epoch 00000 | Batch 200 | Train Loss: 0.7021 | Time: 0.4492 Epoch 00000 | Batch 220 | Train Loss: 0.6619 | Time: 0.4450 Epoch 00000 | Batch 240 | Train Loss: 0.6001 | Time: 0.4437 Epoch 00000 | Batch 260 | Train Loss: 0.5591 | Time: 0.4540 Epoch 00000 | Batch 280 | Train Loss: 0.5115 | Time: 0.3577 Epoch 0 take 134.6200098991394 ``` ### Experiment 2: 4 trainers and 4 samplers: ``` python3 -u /dgl/tools/launch.py --workspace /graph-storm/python/graphstorm/run/gsgnn_lp --num_trainers 4 --num_servers 1 --num_samplers 4 --part_config /data/ogbn_mag_lp_2p/ogbn-mag.json --ip_config /data/ip_list_p2.txt --ssh_port 2222 --graph_format csc,coo "python3 /graph-storm/python/graphstorm/run/gsgnn_lp/gsgnn_lp.py --cf /data/mag_2p_lp.yaml --node-feat-name paper:feat --no-validation true" ``` Output: ``` Epoch 00000 | Batch 000 | Train Loss: 11.1130 | Time: 4.6957 Epoch 00000 | Batch 020 | Train Loss: 3.3098 | Time: 0.7897 Epoch 00000 | Batch 040 | Train Loss: 1.9996 | Time: 0.8633 Epoch 00000 | Batch 060 | Train Loss: 1.5202 | Time: 0.4229 Epoch 0 take 56.44491267204285 successfully save the model to /data/ogbn-map-lp/model/epoch-0 Time on save model 5.461951017379761 ``` By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
🐛 Bug
Training script for link prediction does not work with multiple sampler for PyTorch 1.13. So far, three different bugs were found. In summary:
KeyError: 'dataloader-0'
error fromdgl/distributed/dist_context.py
Note:
Details
Bug 1:
Run command:
Error:
Bug 2:
Run command:
Error:
Bug 3:
Error:
Environment
The text was updated successfully, but these errors were encountered: