Slow Dbnet++ Training #10204

prashantkh19 · 2023-06-19T07:26:00Z

I've been training dbnet++ model from scratch. I've around 3.5M training images, and 17k validation images.
My training config looks like this:

Global:
  debug: false
  use_gpu: true
  epoch_num: 500
  log_smooth_window: 20
  print_batch_step: 10
  save_model_dir: ./output/dbnet_plus_all/
  save_epoch_step: 1
  eval_batch_step: 
   - 0
   - 21000
  cal_metric_during_train: false
  pretrained_model: ../pretrained_models/ResNet50_dcn_asf_synthtext_pretrained
  checkpoints: ./output/dbnet_plus_all/latest
  save_inference_dir: ./inference_dir/
  use_visualdl: true
  infer_img: 
  save_res_path: ../vis_out/dbnet_plus_all/out.txt
Architecture:
  model_type: det
  algorithm: DB++
  Transform: null
  Backbone:
    name: ResNet
    layers: 50
    dcn_stage: [False, True, True, True]
  Neck:
    name: DBFPN
    out_channels: 256
    use_asf: True
  Head:
    name: DBHead
    k: 50
Loss:
  name: DBLoss
  balance_loss: true
  main_loss_type: BCELoss
  alpha: 5
  beta: 10
  ohem_ratio: 3
Optimizer:
  name: Momentum
  momentum: 0.9
  lr:
    name: DecayLearningRate
    learning_rate: 0.007
    epochs: 1000
    factor: 0.9
    end_lr: 0
  weight_decay: 0.0001
PostProcess:
  name: DBPostProcess
  thresh: 0.2
  box_thresh: 0.3
  max_candidates: 1000
  unclip_ratio: 1.8
  det_box_type: 'quad' # 'quad' or 'poly'
Metric:
  name: DetMetric
  main_indicator: hmean
Train:
  dataset:
    name: SimpleDataSet
    data_dir: ../
    label_file_list:
      - ../data_collection_aws/label_files/care_train.txt
      - ../data_collection_aws/label_files/listening_train.txt
      - ../data_collection_aws/label_files/marketing_train.txt
      - ../data_collection_aws/label_files/samsung_train.txt                                             
    ratio_list: [1.0, 1.0, 1.0, 1.0]                                                                                              
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - DetLabelEncode: null
    - IaaAugment:
        augmenter_args:
        - type: Fliplr
          args:
            p: 0.5
        - type: Affine
          args:
            rotate:
            - -10
            - 10
        - type: Resize
          args:
            size:
            - 0.5
            - 3
    - EastRandomCropData:
        size:
        - 960
        - 960
        max_tries: 10
        keep_ratio: true
    - MakeShrinkMap:
        shrink_ratio: 0.6
        min_text_size: 8
    - MakeBorderMap:
        shrink_ratio: 0.6
        thresh_min: 0.3
        thresh_max: 0.7
    - NormalizeImage:
        scale: 1./255.
        mean:
        - 0.48109378172549
        - 0.45752457890196
        - 0.40787054090196
        std:
        - 1.0
        - 1.0
        - 1.0
        order: hwc
    - ToCHWImage: null
    - KeepKeys:
        keep_keys:
        - image
        - threshold_map
        - threshold_mask
        - shrink_map
        - shrink_mask
  loader:
    shuffle: true
    drop_last: false
    batch_size_per_card: 16
    num_workers: 4
Eval:
  dataset:
    name: SimpleDataSet
    data_dir: ../
    label_file_list:
      - ../data_collection_aws/label_files/test_all.txt 
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - DetLabelEncode: null
    - DetResizeForTest:
        limit_side_len: 1080
        limit_type: max
    - NormalizeImage:
        scale: 1./255.
        mean:
        - 0.48109378172549
        - 0.45752457890196
        - 0.40787054090196
        std:
        - 1.0
        - 1.0
        - 1.0
        order: hwc
    - ToCHWImage: null
    - KeepKeys:
        keep_keys:
        - image
        - shape
        - polys
        - ignore_tags
  loader:
    shuffle: false
    drop_last: false
    batch_size_per_card: 1
    num_workers: 1
profiler_options: null

I'm using 8 A100 40GB GPUs, and still my training ETA is 17-18days. Is there anything wrong with the config, how can I speed up the training?

The text was updated successfully, but these errors were encountered:

livingbody · 2023-06-30T09:46:28Z

turn up your 'batch_size_per_card', use Full GPU memory

shiyutang · 2023-07-02T06:36:30Z

The above question has answered the question, welcome to reopen the question if there are any following issues. And we are now holding a contribution to Paddleseg and Paddle OCR activity, you are welcome to join:#10223

paddle-bot bot assigned tink2123 Jun 19, 2023

ToddBear added the good first issue Good for newcomers label Jun 30, 2023

livingbody mentioned this issue Jun 30, 2023

🏅️飞桨套件快乐开源常规赛 #10223

Closed

shiyutang closed this as completed Jul 2, 2023

paddle-bot bot added the status/close label Jul 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow Dbnet++ Training #10204

Slow Dbnet++ Training #10204

prashantkh19 commented Jun 19, 2023

livingbody commented Jun 30, 2023

shiyutang commented Jul 2, 2023

Slow Dbnet++ Training #10204

Slow Dbnet++ Training #10204

Comments

prashantkh19 commented Jun 19, 2023

livingbody commented Jun 30, 2023

shiyutang commented Jul 2, 2023