DBNet 混合精度训练不收敛 #12445

Jverson · 2023-09-12T01:19:57Z

Jverson
Sep 12, 2023

系统环境/System Environment：Ubuntu
版本号/Version：PaddleOCR：v2.3/v2.5/v2.7都试过
问题相关组件/Related components：cuda: 10.2
paddlepaddle-gpu：2.4.2
paddle-bfloat: 0.1.7

问题描述：
DBNet开启混合精度训练，始终不收敛，scale越来越小，直到0

训练命令：
python3 -m paddle.distributed.launch --gpus '0,1,2,3,4,5,6,7' tools/train.py -c configs/det/det_mv3_db_amp.yml

配置文件：
Global:
use_gpu: true
use_xpu: false
use_mlu: false
epoch_num: 200
log_smooth_window: 20
print_batch_step: 10
save_model_dir: /data/huajie/ocr_project/ppocr_output_det/det_dbnet_230912_db_v4_amp
save_epoch_step: 200
eval_batch_step: [10000, 2000]
cal_metric_during_train: False
pretrained_model: /data/huajie/ocr_project/ppocr_output_det/pre_train_model/MobileNetV3_large_x0_5_pretrained
checkpoints:
save_inference_dir:
use_visualdl: False
infer_img: doc/imgs_en/img_10.jpg
save_res_path: /data/huajie/ocr_project/ppocr_output_det/det_dbnet_230912_db_v4_amp/predicts_db.txt
use_amp: true
scale_loss: 1024.0
use_dynamic_loss_scaling: true

Architecture:
model_type: det
algorithm: DB
Transform:
Backbone:
name: MobileNetV3
scale: 0.5
model_name: large
Neck:
name: DBFPN
out_channels: 256
Head:
name: DBHead
k: 50

Loss:
name: DBLoss
balance_loss: true
main_loss_type: DiceLoss
alpha: 5
beta: 10
ohem_ratio: 3

Optimizer:
name: Adam
beta1: 0.9
beta2: 0.999
lr:
learning_rate: 0.001
regularizer:
name: 'L2'
factor: 0

PostProcess:
name: DBPostProcess
thresh: 0.3
box_thresh: 0.6
max_candidates: 1000
unclip_ratio: 1.5

Metric:
name: DetMetric
main_indicator: hmean

Train:
dataset:
name: SimpleDataSet
data_dir: /data/huajie/ocr_data/scanpen_det/
label_file_list:
- /data/huajie/ocr_data/dbnet_230720/det_label_train.txt
ratio_list: [1.0]
transforms:
- DecodeImage: # load image
img_mode: BGR
channel_first: False
- DetLabelEncode: # Class handling label
- IaaAugment:
augmenter_args:
- { 'type': Fliplr, 'args': { 'p': 0.5 } }
- { 'type': Affine, 'args': { 'rotate': [-10, 10] } }
- { 'type': Resize, 'args': { 'size': [0.5, 3] } }
- EastRandomCropData:
size: [640, 640]
max_tries: 50
keep_ratio: true
- MakeBorderMap:
shrink_ratio: 0.4
thresh_min: 0.3
thresh_max: 0.7
- MakeShrinkMap:
shrink_ratio: 0.4
min_text_size: 8
- NormalizeImage:
scale: 1./255.
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
order: 'hwc'
- ToCHWImage:
- KeepKeys:
keep_keys: ['image', 'threshold_map', 'threshold_mask', 'shrink_map', 'shrink_mask'] # the order of the dataloader list
loader:
shuffle: True
drop_last: False
batch_size_per_card: 16
num_workers: 8
use_shared_memory: True

Eval:
dataset:
name: SimpleDataSet
data_dir: /data/huajie/ocr_data/scanpen_det/
label_file_list:
- /data/huajie/ocr_data/dbnet_230720/det_label_eval.txt
transforms:
- DecodeImage: # load image
img_mode: BGR
channel_first: False
- DetLabelEncode: # Class handling label
- DetResizeForTest:
image_shape: [736, 1280]
- NormalizeImage:
scale: 1./255.
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
order: 'hwc'
- ToCHWImage:
- KeepKeys:
keep_keys: ['image', 'shape', 'polys', 'ignore_tags']
loader:
shuffle: False
drop_last: False
batch_size_per_card: 1 # must be 1
num_workers: 8
use_shared_memory: True

日志：
[2023/09/12 08:59:36] ppocr INFO: epoch: [1/200], global_step: 260, lr: 0.001000, loss: 9.192334, loss_shrink_maps: 4.588675, loss_threshold_maps: 3.721466, loss_binary_maps: 0.917968, loss_cbn: 0.000000, avg_reader_cost: 0.00402 s, avg_batch_cost: 0.93513 s, avg_samples: 16.0, ips: 17.10989 samples/s, eta: 1 day, 1:10:51
Found inf or nan, current scale is: 7.52316384526264e-37, decrease to: 7.52316384526264e-370.5
Found inf or nan, current scale is: 3.76158192263132e-37, decrease to: 3.76158192263132e-370.5
Found inf or nan, current scale is: 1.88079096131566e-37, decrease to: 1.88079096131566e-370.5
Found inf or nan, current scale is: 9.4039548065783e-38, decrease to: 9.4039548065783e-380.5
Found inf or nan, current scale is: 4.70197740328915e-38, decrease to: 4.70197740328915e-380.5
[2023/09/12 08:59:44] ppocr INFO: epoch: [1/200], global_step: 270, lr: 0.001000, loss: 9.182466, loss_shrink_maps: 4.572325, loss_threshold_maps: 3.721908, loss_binary_maps: 0.915590, loss_cbn: 0.000000, avg_reader_cost: 0.00384 s, avg_batch_cost: 0.81640 s, avg_samples: 16.0, ips: 19.59831 samples/s, eta: 1 day, 1:05:13
Found inf or nan, current scale is: 2.350988701644575e-38, decrease to: 2.350988701644575e-380.5
Found inf or nan, current scale is: 1.1754943508222875e-38, decrease to: 1.1754943508222875e-380.5
Found inf or nan, current scale is: 5.877471754111438e-39, decrease to: 5.877471754111438e-390.5
Found inf or nan, current scale is: 2.938735877055719e-39, decrease to: 2.938735877055719e-390.5
Found inf or nan, current scale is: 1.4693679385278594e-39, decrease to: 1.4693679385278594e-390.5
[2023/09/12 08:59:51] ppocr INFO: epoch: [1/200], global_step: 280, lr: 0.001000, loss: 9.154133, loss_shrink_maps: 4.556365, loss_threshold_maps: 3.687920, loss_binary_maps: 0.912402, loss_cbn: 0.000000, avg_reader_cost: 0.00361 s, avg_batch_cost: 0.61952 s, avg_samples: 16.0, ips: 25.82635 samples/s, eta: 1 day, 0:48:14
Found inf or nan, current scale is: 7.346839692639297e-40, decrease to: 7.346839692639297e-400.5
Found inf or nan, current scale is: 3.6734198463196485e-40, decrease to: 3.6734198463196485e-400.5
Found inf or nan, current scale is: 1.8367099231598242e-40, decrease to: 1.8367099231598242e-400.5
Found inf or nan, current scale is: 9.183549615799121e-41, decrease to: 9.183549615799121e-410.5
Found inf or nan, current scale is: 4.591774807899561e-41, decrease to: 4.591774807899561e-410.5
[2023/09/12 08:59:59] ppocr INFO: epoch: [1/200], global_step: 290, lr: 0.001000, loss: 9.191483, loss_shrink_maps: 4.556365, loss_threshold_maps: 3.690576, loss_binary_maps: 0.913825, loss_cbn: 0.000000, avg_reader_cost: 0.00864 s, avg_batch_cost: 0.83550 s, avg_samples: 16.0, ips: 19.15032 samples/s, eta: 1 day, 0:44:50
Found inf or nan, current scale is: 2.2958874039497803e-41, decrease to: 2.2958874039497803e-410.5
Found inf or nan, current scale is: 1.1479437019748901e-41, decrease to: 1.1479437019748901e-410.5
Found inf or nan, current scale is: 5.739718509874451e-42, decrease to: 5.739718509874451e-420.5
Found inf or nan, current scale is: 2.8698592549372254e-42, decrease to: 2.8698592549372254e-420.5
Found inf or nan, current scale is: 1.4349296274686127e-42, decrease to: 1.4349296274686127e-420.5
[2023/09/12 09:00:04] ppocr INFO: epoch: [1/200], global_step: 300, lr: 0.001000, loss: 9.258297, loss_shrink_maps: 4.605122, loss_threshold_maps: 3.727853, loss_binary_maps: 0.921244, loss_cbn: 0.000000, avg_reader_cost: 0.00138 s, avg_batch_cost: 0.49900 s, avg_samples: 16.0, ips: 32.06392 samples/s, eta: 1 day, 0:22:57
Found inf or nan, current scale is: 7.174648137343064e-43, decrease to: 7.174648137343064e-430.5
Found inf or nan, current scale is: 3.587324068671532e-43, decrease to: 3.587324068671532e-430.5
Found inf or nan, current scale is: 1.793662034335766e-43, decrease to: 1.793662034335766e-430.5
Found inf or nan, current scale is: 8.96831017167883e-44, decrease to: 8.96831017167883e-440.5
Found inf or nan, current scale is: 4.484155085839415e-44, decrease to: 4.484155085839415e-440.5
[2023/09/12 09:00:15] ppocr INFO: epoch: [1/200], global_step: 310, lr: 0.001000, loss: 9.290211, loss_shrink_maps: 4.616122, loss_threshold_maps: 3.759188, loss_binary_maps: 0.924372, loss_cbn: 0.000000, avg_reader_cost: 0.00120 s, avg_batch_cost: 1.04335 s, avg_samples: 16.0, ips: 15.33520 samples/s, eta: 1 day, 0:31:46
Found inf or nan, current scale is: 2.2420775429197073e-44, decrease to: 2.2420775429197073e-440.5
Found inf or nan, current scale is: 1.1210387714598537e-44, decrease to: 1.1210387714598537e-440.5
Found inf or nan, current scale is: 5.605193857299268e-45, decrease to: 5.605193857299268e-450.5
Found inf or nan, current scale is: 2.802596928649634e-45, decrease to: 2.802596928649634e-450.5
Found inf or nan, current scale is: 1.401298464324817e-45, decrease to: 1.401298464324817e-450.5
[2023/09/12 09:00:22] ppocr INFO: epoch: [1/200], global_step: 320, lr: 0.001000, loss: 9.306165, loss_shrink_maps: 4.619713, loss_threshold_maps: 3.759188, loss_binary_maps: 0.924158, loss_cbn: 0.000000, avg_reader_cost: 0.00634 s, avg_batch_cost: 0.72490 s, avg_samples: 16.0, ips: 22.07212 samples/s, eta: 1 day, 0:23:25
Found inf or nan, current scale is: 0.0, decrease to: 0.00.5
Found inf or nan, current scale is: 0.0, decrease to: 0.00.5
Found inf or nan, current scale is: 0.0, decrease to: 0.00.5
Found inf or nan, current scale is: 0.0, decrease to: 0.00.5
Found inf or nan, current scale is: 0.0, decrease to: 0.00.5
[2023/09/12 09:00:28] ppocr INFO: epoch: [1/200], global_step: 330, lr: 0.001000, loss: 9.273979, loss_shrink_maps: 4.605062, loss_threshold_maps: 3.747232, loss_binary_maps: 0.922415, loss_cbn: 0.000000, avg_reader_cost: 0.00323 s, avg_batch_cost: 0.61246 s, avg_samples: 16.0, ips: 26.12411 samples/s, eta: 1 day, 0:09:52
Found inf or nan, current scale is: 0.0, decrease to: 0.00.5
Found inf or nan, current scale is: 0.0, decrease to: 0.00.5
Found inf or nan, current scale is: 0.0, decrease to: 0.00.5
Found inf or nan, current scale is: 0.0, decrease to: 0.00.5
Found inf or nan, current scale is: 0.0, decrease to: 0.00.5

Answered by andyjiang1116

Jul 24, 2024

原因：amp不收敛是由于conv在amp下会产生上溢，由于模型中的DBFPN结构中 https://github.com/PaddlePaddle/PaddleOCR/blob/main/ppocr/modeling/necks/db_fpn.py#L123-L181 conv2d层输出缺少BN层进行归一化，导致模型不收敛。
解决方案：可以将conv2d结构替换为 ConvBNLayer https://github.com/PaddlePaddle/PaddleOCR/blob/main/ppocr/modeling/backbones/det_mobilenet_v3.py#L158-L200。经验证，模型可以收敛，精度为 best metric, hmean: 0.7441052370315215

View full answer

andyjiang1116 · 2024-07-24T02:22:15Z

andyjiang1116
Jul 24, 2024
Collaborator

原因：amp不收敛是由于conv在amp下会产生上溢，由于模型中的DBFPN结构中 https://github.com/PaddlePaddle/PaddleOCR/blob/main/ppocr/modeling/necks/db_fpn.py#L123-L181 conv2d层输出缺少BN层进行归一化，导致模型不收敛。
解决方案：可以将conv2d结构替换为 ConvBNLayer https://github.com/PaddlePaddle/PaddleOCR/blob/main/ppocr/modeling/backbones/det_mobilenet_v3.py#L158-L200。经验证，模型可以收敛，精度为 best metric, hmean: 0.7441052370315215

1 reply

tink2123 Jul 24, 2024
Maintainer

感谢反馈，确认存在该问题。我们会尝试修复conv的精度上溢，暂时可以先通过上述修改结构的方案规避此问题。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DBNet 混合精度训练不收敛 #12445

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

DBNet 混合精度训练 不收敛 #12445

Jverson Sep 12, 2023

Replies: 1 comment · 1 reply

andyjiang1116 Jul 24, 2024 Collaborator

tink2123 Jul 24, 2024 Maintainer

DBNet 混合精度训练不收敛 #12445

Jverson
Sep 12, 2023

Replies: 1 comment 1 reply

andyjiang1116
Jul 24, 2024
Collaborator

tink2123 Jul 24, 2024
Maintainer