-
Notifications
You must be signed in to change notification settings - Fork 31
Learning Rate Scheduling of Transformer
Transformer model has its unique way to do learning rate scheduling. Usually it has a warmup phase and a decay phase. During warmup phase, the learning rate increases linearly, while decays in reservse-square or exponential method. There are many configurations about such kind of scheduling and many of them are included in tensor2tensor. In order to follow the lastest progress, we manage to make our code compatible with t2t on this part. Some work are still working in progress.
T2T now has two kinds of scheduling configuration: legacy configuration, which is the same as the method in Attention is All You Need, and a factored configuration, which combines different kinds of scheduling in different timesteps. This code now only support the former one, and is working on implement the latter.
The overall formular about legacy scheduling, or Noam, is
$$
\text{lrate} = \text{ret} \text{opt_corr} * \text{init_lr}
$$
where
As we use Adam to train transformer,
T2T has two version settings, base_v1 and base_v2. In base_v1,
How to configure noam in our code Take base_v1 as example, we can configure it in yaml file like below:
optimizer_configs:
optimizer: "adam"
learning_rate: 0.1
schedule_method: noam
scheduler_configs:
d_model: 512
warmup_steps: 4000