Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

The errors on multiprocessing may related to the dataset "syntext1_96voc" #48

Open
YuMJie opened this issue Oct 11, 2023 · 2 comments
Open

Comments

@YuMJie
Copy link

YuMJie commented Oct 11, 2023

When I try to run the code python tools/train_net.py --config-file configs/R_50/CTW1500/finetune_96voc_50maxlen.yaml --num-gpus 4
some errors occur. However ,it can run correctly if I set the --num-gpus 1 or change the code

DATASETS:
  TRAIN: ("ic13_train_96voc","totaltext_train_96voc")
  TEST: ("ctw1500_test",)

on config file configs/R_50/CTW1500/pretrain_96voc_50maxlen.yaml
and it will be error when set ' TRAIN: ("syntext1_96voc","ic13_train_96voc","totaltext_train_96voc")'

[10/11 08:19:10 adet.data.dataset_mapper]: Cropping used in training: RandomCropWithInstance(crop_type='relative_range', crop_size=[0.1, 0.1], crop_instance=False)
[10/11 08:19:11 adet.data.datasets.text]: Loaded 229 images in COCO format from /dataset/ic13/train_96voc.json
[10/11 08:19:46 adet.data.datasets.text]: Loading /dataset/syntext1/annotations/train_96voc.json takes 35.33 seconds.
[10/11 08:19:47 adet.data.datasets.text]: Loaded 94723 images in COCO format from /dataset/syntext1/annotations/train_96voc.json
[10/11 08:24:02 d2.data.build]: Removed 0 images with no usable annotations. 94950 images left.
[10/11 08:24:02 d2.data.build]: Using training sampler TrainingSampler
[10/11 08:24:03 d2.data.common]: Serializing 94950 elements to byte tensors and concatenating them all ...
Traceback (most recent call last):
  File "train_net.py", line 304, in <module>
    launch(
  File "/usr/local/lib/python3.8/dist-packages/detectron2/engine/launch.py", line 67, in launch
    mp.spawn(
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 130, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL

The structure tree of my dataset is as follow:

.
├── ArT
│   ├── art_train.json
│   └── rename_artimg_train
├── CTW1500
│   ├── test.json
│   ├── test_images
│   ├── train_96voc.json
│   ├── train_images
│   ├── weak_voc_new.txt
│   └── weak_voc_pair_list.txt
├── ChnSyntext
│   ├── chn_syntext.json
│   └── syn_130k_images
├── LSVT
│   ├── annotations
│   ├── lsvt_train.json
│   └── rename_lsvtimg_train
├── ReCTS
│   ├── ReCTS_test_images
│   ├── ReCTS_train_images
│   ├── ReCTS_val_images
│   ├── rects_test.json
│   ├── rects_train.json
│   └── rects_val.json
├── evaluation
│   ├── gt_ctw1500.zip
│   ├── gt_icdar2015.zip
│   ├── gt_inversetext.zip
│   └── gt_totaltext.zip
├── ic13
│   ├── train_37voc.json
│   ├── train_96voc.json
│   └── train_images
├── ic15
│   ├── GenericVocabulary.txt
│   ├── GenericVocabulary_new.txt
│   ├── GenericVocabulary_pair_list.txt
│   ├── ch4_test_vocabulary.txt
│   ├── ch4_test_vocabulary_new.txt
│   ├── ch4_test_vocabulary_pair_list.txt
│   ├── ic15_test.json
│   ├── ic15_train.json
│   ├── new_strong_lexicon
│   ├── strong_lexicon
│   ├── test.json
│   ├── test_images
│   ├── train_37voc.json
│   ├── train_96voc.json
│   └── train_images
├── inversetext
│   ├── inversetext_lexicon.txt
│   ├── inversetext_pair_list.txt
│   ├── test.json
│   └── test_images
├── mlt2017
│   ├── train_37voc.json
│   ├── train_96voc.json
│   └── train_images
├── syntext1
│   ├── annotations
│   ├── train.json
│   └── train_images
├── syntext2
│   ├── annotations
│   ├── train.json
│   ├── train_37voc.json
│   ├── train_96voc.json
│   └── train_images
├── textocr
│   ├── train_37voc_1.json
│   ├── train_37voc_2.json
│   └── train_images
└── totaltext
    ├── test.json
    ├── test_images
    ├── train.json
    ├── train_37voc.json
    ├── train_96voc.json
    ├── train_images
    ├── weak_voc_new.txt
    └── weak_voc_pair_list.txt
@ymy-k
Copy link
Collaborator

ymy-k commented Oct 12, 2023

Try to reduce the number of workers? Maybe it's a memory issue?

@YuMJie
Copy link
Author

YuMJie commented Oct 12, 2023

Try to reduce the number of workers? Maybe it's a memory issue?

I try reducing the number of workers , however , it also occur this error.
But interestingly,it can work if set
DATASETS: TRAIN: ("totaltext_train_96voc") TEST: ("ctw1500_test",)
and i can correctly run
python tools/train_net.py --config-file configs/R_50/pretrain/150k_tt_mlt_13_15.yaml --num-gpus 4
python tools/train_net.py --config-file configs/R_50/TotalText/finetune_150k_tt_mlt_13_15.yaml --num-gpus 4
python tools/train_net.py --config-file configs/R_50/IC15/finetune_150k_tt_mlt_13_15.yaml --num-gpus 4
but it can not run on the dataset syntext1_96voc ,syntext2_96voc
Thank for your reply!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants