Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

E-Branchformerモデルの検証 #8

Open
fujimotos opened this issue Feb 6, 2023 · 5 comments
Open

E-Branchformerモデルの検証 #8

fujimotos opened this issue Feb 6, 2023 · 5 comments

Comments

@fujimotos
Copy link
Member

チケットのゴール

  • 現在のReazonSpeechはConformerベースの音声認識モデルである。
  • E-Branchformerベースのレシピを作成し、モデルの訓練を行う。
  • 構築したモデルを検証して、精度改善等の検証を行う。

参考リンク

@euyniy
Copy link

euyniy commented Apr 7, 2023

この前ESPnet2 Librispeechのレシピを使ってreazonspeech medium (500h~)を元で31epochの訓練を走ってみました。ログは以下です(まだサチっていないよう):

2023-02-16 07:43:47,882 (trainer:338) INFO: 31epoch results: 
[train] iter_time=2.915e-04, forward_time=0.100, loss_ctc=40.290, loss_att=30.836, acc=0.690, loss=33.672, 
backward_time=0.117, optim_step_time=0.083, optim0_lr0=2.697e-04, train_time=28.888, time=1 hour, 25 minutes and 
30.11 seconds, total_count=550157, gpu_max_cached_mem_GB=4.861, 
[valid] loss_ctc=21.431, cer_ctc=0.259, loss_att=16.760, acc=0.834, cer=0.222, wer=0.849, loss=18.161, time=44.84 
seconds, total_count=3255, gpu_max_cached_mem_GB=4.861, 
Loss CER
loss cer

参考として、今のconformer-transformerモデル(パラメーターが変わりますが)はこういう感じです。

2023-02-11 03:22:58,191 (trainer:338) INFO: 31epoch results: 
[train] iter_time=2.541e-04, forward_time=0.077, loss_ctc=31.263, loss_att=17.444, acc=0.787, loss=21.590, 
backward_time=0.063, optim_step_time=0.057, optim0_lr0=7.346e-04, train_time=6.864, time=34 minutes and 31.61 
seconds, total_count=280519, gpu_max_cached_mem_GB=4.801, 
[valid] loss_ctc=22.093, cer_ctc=0.266, loss_att=12.771, acc=0.859, cer=0.194, wer=0.799, loss=15.567, time=12.59 
seconds, total_count=1674, gpu_max_cached_mem_GB=4.801, [att_plot] time=1 minute and 6.61 seconds, total_count=0, 
gpu_max_cached_mem_GB=4.801

今のところ大規模で回す計画はないですが、branchformerの実験に関して何か進捗があったらまたここに貼らせていただきます。

@sw005320
Copy link

sw005320 commented Apr 7, 2023

@pyf98, maybe you can help them.
You can translate this into English (or Chinese).

I think their learning rate is too low in this scenario, or there is something wrong with the actual batchsize (with multiple GPUs or gradient accumulation).

@pyf98
Copy link

pyf98 commented Apr 7, 2023

I'm not sure what Conformer and E-Branchformer configs are being used exactly. I feel some configs might have issues.

The Conformer config provided above has 12 layers without Macaron FFN. The input layer downsamples 6 times. These are different from the configs in other recipes (e.g., LibriSpeech). If you simply use the same E-Branchformer config from LibriSpeech, there can be some issues. For example, the model can be much larger.

In our experiments, we scale Conformer and E-Branchformer to have similar parameter counts. In such cases, we usually do not need to tune the training hyper-parameters again. We have added E-Branchformer configs and results in many other ESPnet2 recipes covering various types of speech.

@euyniy
Copy link

euyniy commented Apr 11, 2023

@pyf98 @sw005320
Thanks for your input!
The experiment above was conducted with this config on 500h~ of data. The E-Branchformer model has 145M params and the Conformer used for comparison has 91M. (btw, In our latest released conformer model we enabled Macaron FFN)

Will check the lr/accum_grads/multi-gpu/downsampling configurations and other recipes as well when we run more experiments on larger dataset!

@pyf98
Copy link

pyf98 commented Apr 12, 2023

Thanks for the information. When comparing these models (E-Branchformer vs Conformer), we typically just replaced the encoder config (at a similar model size) but kept the other training configs the same. This worked well in general.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants