Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Training process of Mask R-CNN crashes when workers parameter is greater than 0 #67

Open
cywinski opened this issue Feb 2, 2023 · 2 comments

Comments

@cywinski
Copy link

cywinski commented Feb 2, 2023

Hi,

I try to train the Mask R-CNN object detection model with EfficientNet B3 as backbone on my custom dataset. I train the model according to the configuration provided here on 2 GPUs with a batch size of 4 per every GPU. I start the training with a command presented here.

When I run the training with workers parameter greater than 0 after a couple of epochs/steps I get the following error, after which the training crushes:

Traceback (most recent call last):  File "/opt/venv/bin/cvnets-train", line 33, in <module>    sys.exit(load_entry_point('cvnets', 'console_scripts', 'cvnets-train')())
  File "/home/ir/pvc/ml-cvnets/main_train.py", line 234, in main_worker
    torch.multiprocessing.spawn(
  File "/opt/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/venv/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGKIL

/usr/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 28 leaked semaphore objects to clean up at shutdown

However, when I set the workers to 0, the training seems to work properly but very slowly - I was able to train the model for 23 epochs, where each epoch took over 3 hours.

Do you know what might be the cause of this problem?

@tuobaye11
Copy link

how do i use ml-cvnet?
when i use follow command

export CFG_FILE="config/classification/imagenet/resnet.yaml"
cvnets-train --common.config-file $CFG_FILE --common.results-loc classification_results

main_train: Command not found

@farzadab
Copy link
Collaborator

farzadab commented Apr 7, 2023 via email

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants