Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

batch_size的问题? #2

Open
jiangchaokang opened this issue Jul 17, 2022 · 1 comment
Open

batch_size的问题? #2

jiangchaokang opened this issue Jul 17, 2022 · 1 comment

Comments

@jiangchaokang
Copy link

Good job!
非常好的工作!

我在训练模型的时候当batch_size=1的时候可以完美训练,在单张3090显卡上。
但是将batch_size设置为2就会报错。错误如下:
return torch.stack(batch, 0, out=out)
RuntimeError: stack expects each tensor to be equal size, but got [15, 2182] at entry 0 and [15, 5269] at entry 1
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
似乎是启用多线程(num_workers!=0)提示哪个线程有问题,因为batch合并时维度不一样,导致第一个线程就挂了(worker process 0)

请问你们那边可以跑通更大的batch_size吗? 为什么在您提供的代码库中我遇到了这样的问题呢?
我运行的代码如下:
python -m torch.distributed.launch --nproc_per_node=1 --master_port 1234 main_DCAdapt.py configs/train_adapt_ft3d_kitti.yaml

@JZ-9962
Copy link
Collaborator

JZ-9962 commented Jul 18, 2022

Hi, thanks for your interest in our work!

因为我们采用了HPLFlowNet作为backbone,它的网络输入具有可变大小,因此我们采用了distributed data parallel的方法来支持多GPU训练(每个GPU的batch_size为1),具体可以参考:laoreja/HPLFlowNet#11

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants