Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Problem about "assert (self.history_seq_ids != seq_ids)[~start_of_sequence].sum() == 0" during training #46

Open
polyethylene16 opened this issue Oct 31, 2024 · 2 comments

Comments

@polyethylene16
Copy link

polyethylene16 commented Oct 31, 2024

Thank you very much for your excellent work!But I have a distressing problem with reproduction.

  • Background: I used ConvNeXtv2-Base to replace the ResNet50 in fbocc-r50-cbgs_depth_16f_16x4_20e.py as the backbone. All other parameters remained the same, and I then followed the example in the start.md and ran the code ./tools/dist_train.sh ./occupancy_configs/fb_occ/fbocc-r50-cbgs_depth_16f_16x4_20e.py 2 to distribute the training across two devices.
  • Problem: I successfully trained two epochs, but at the third epoch, an error was reported, and although the error does not seem to be a problem that appears specifically in FB-OCC, I would like to get your help.
  • Detailed Error Information:
...
2024-10-30 19:18:05,512 - mmdet - INFO - Iter [4000/39980]   lr:2.000e-04, eta:19:56:44,  time:  2.128,  data_time:  0.016,  memory:  26404,  loss_voxel_ce_c_0:  1.1668,  loss_voxel_sem_scal_c_0:  6.0262,  loss_voxel_geo_scal_c_0:  1.1700,  loss_voxel_lovasz_c_0:  0.8033,  loss_depth:  4.5904,  loss:  13.7567,  grad_norm:  438604
Traceback (most recent call last):
  File "./tools/train.py", line 373, in <module>
     main()
  File " ./tools/train.py", line 362, in main
     train_model(
  File "/path/to/my/workspace/mmdetection3d/mmdet3d/apis/train.py", line 28, in train_model
     train_detector(
  File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/mmdet/apis/train.py", line 170, in train_detector
     runner.run(data_loaders, cfg.workflow)
  File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/mmcv/runner/iter_based_runner.py", line 134, in run
     iter_runner(iter_loaders[i], **kwards)
  File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/mmcv/runner/iter_based_runner.py", line 61, in train
     outputs = self.model.train_step(data_batch, self.optimizer, **kwards)
  File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
     output = self.module.train_step(*inputs[0], **kwards[0])
  File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/mmdet/models/detectors/base.py", in line 237, in train_step
     losses = self(**data)
  File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
     return forward_call(*input, **kwards)
  File "/opt/conda/envs/mtbev/lib/python 3.8/site-packages/mmcv/runner/fp_utils.py", line 128, in new_func
     output = old_func(*new_args, **new_kwards)
  File "/path/to/my/workspace/mmdetion3d/mmdet3d/models/detectors/base.py", line 59, in forward
     return self.forward_train(**kwards)
  File "/path/to/my/workspace/projects/mmdet3d_plugin/occ/detectors/fbocc.py", line 430, in forward_train
     results= self.extract_feat(
  File "/path/to/my/workspace/projects/mmdet3d_plugin/occ/detectors/fbocc.py", line 384, in extract_feat
     results.update(self.extract_img_bev_feat(img, img_metas, **kwards))
  File "/path/to/my/workspace/projects/mmdet3d_plugin/occ/detectors/fbocc.py", line 362, in extract_img_bev_feat
     bev_feat = self.fuse_history(bev_feat, img_metas, img[6])
  File "opt/conda/evs/mtbev/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 214, in new_func
     output = old_func(*new_args, **new_kwards)
  File "/path/to/my/workspace/projects/mmdet3d_plugin/occ/detectors/fbocc.py", line 207, in fuse_history
     assert (self.history_seq_ids != seq_ids)[~start_of_sequence].sum() == 0, \
AssertionError: tensor([555, 555, 555, 555], device='cuda:1'), tensor([965, 965, 965, 965], device='cuda:1'), tensor([False, False, False, False], device='cuda:1')
ERROR:torch.distributed.elastic. multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 414487) of binary: /opt/conda/envs/mtbev/bin/python
...

It's worth mentioning that I used your data prep tool exclusively to prepare the data.

@KUGDXL
Copy link

KUGDXL commented Dec 29, 2024

I've also encountered a similar problem. Have you resolved it? Thanks.

@polyethylene16
Copy link
Author

I've also encountered a similar problem. Have you resolved it? Thanks.

No, I still have no idea about it. Unfortunately, I'm no longer working on the Occ project for some reason, but I still look forward to replies from the author or other practitioners.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants