Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

RuntimeError: CUDA error: an illegal memory access was encountered #6

Open
wsonia opened this issue Dec 8, 2021 · 4 comments
Open

Comments

@wsonia
Copy link

wsonia commented Dec 8, 2021

I deploy the same environment and use the public cardiac data to run the code. But got this problem while training:
Validation sanity check: 0%| | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 137, in
trainer.fit(net, datamodule=data_module)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in fit
self._call_and_handle_interrupt(
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run
self._dispatch()
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1274, in _dispatch
self.training_type_plugin.start_training(self)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1284, in run_stage
return self._run_train()
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1306, in _run_train
self._run_sanity_check(self.lightning_module)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1370, in _run_sanity_check
self._evaluation_loop.run()
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 109, in advance
dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 122, in advance
output = self._evaluation_step(batch, batch_idx, dataloader_idx)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 217, in _evaluation_step
output = self.trainer.accelerator.validation_step(step_kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 236, in validation_step
return self.training_type_plugin.validation_step(*step_kwargs.values())
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 444, in validation_step
return self.model(*args, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 619, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 92, in forward
output = self.module.validation_step(*inputs, **kwargs)
File "/3D-UCaps-main/module/ucaps.py", line 265, in validation_step
val_outputs = sliding_window_inference(
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/monai/inferers/utils.py", line 130, in sliding_window_inference
seg_prob = predictor(window_data, *args, **kwargs).to(device) # batched patch segmentation
File "/3D-UCaps-main/module/ucaps.py", line 171, in forward
x = self.feature_extractor(x)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/anaconda3/envs/UCaps/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 572, in forward
return F.conv3d(input, self.weight, self.bias, self.stride,
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error in: /opt/conda/conda-bld/pytorch_1607370172916/work/torch/lib/c10d/../c10d/NCCLUtils.hpp:136, unhandled cuda error, NCCL version 2.7.8
./train_ucaps_cardiac.sh: line 25: 171684 Aborted (core dumped) python train.py --log_dir ./3D-UCaps-main/logs_heart --gpus 1 --accelerator ddp --check_val_every_n_epoch 5 --max_epochs 100 --dataset task02_heart --model_name ucaps --root_dir ./3D-UCaps-main/Task02_Heart --fold 0 --cache_rate 1.0 --train_patch_size 128 128 128 --num_workers 64 --batch_size 1 --share_weight 0 --num_samples 1 --in_channels 1 --out_channels 2 --val_patch_size(UCaps

@hoangtan96dl
Copy link
Contributor

hoangtan96dl commented Dec 11, 2021

hello @wentj897, can I know which device you are using? From my experience, it can be due to out of memory error, you can reduce memory requirement by:

  • Reduce batch_size, num_samples, train_patch_size for training step
  • Reduce sw_batch_size, val_patch_size for validation step

You are having a problem with the validation step so I think you should reduce val_patch_size to a smaller number like 64x64x64 that can fit with your device. Additionally, the training step will use even larger memory so I think you should reduce them also.

@kingjames1155
Copy link

thank you very much for your open source code. I also encountered this problem in the training stage. How do you solve it? My GPU is 3090, cuda11.3.I've tried to reduce batch_size, num_samples,train_patch_size, but it not work.

@kingjames1155
Copy link

path/to/luna/
imgs
segs

Is the file extracted from subset0-subset9 stored in folder imgs?
Is the extracted file seg-luns-luna16 stored in folder segs?
Do they need any other pretreatment?

@hoangtan96dl
Copy link
Contributor

Hello @kingjames1155 , sorry for my late reply.
For your first question, if you have the same problem as @wentj897 , actually you are getting the error at the validation step, not the training step. Hence, try to reduce the val_patch_size and sw_batch_size first to see if it can solve the problem.
For your second question, the answer is yes, you just need to extract LUNA dataset as it is. I also point out some error files in the README that you need to remove.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants