-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
how to generate balanced batches for multilingual translation? #890
Comments
One easy way is to keep current way to load the dataset, and we re-weigh the loss of batches from each language pair as the ratio of original data. At |
facebook-github-bot
pushed a commit
that referenced
this issue
Oct 27, 2019
Summary: TEST 1: EVALUATION TIME WORKS checked achieves correct model perplexity: 18.68 TEST 2: TRAINING NEW MODEL WORKS checked without layerdrop: --decoder-layerdrop 0 OR no flag at all | epoch 001: 10 / 11201 loss=27.469, nll_loss=27.469, ppl=185799477.36, wps=1764, ups=0, wpb=9216.000, bsz=3.000, num_updates=7, lr=0.0004376, gnorm=25.471, clip=1.000, oom=0.000, loss_scale=8.000, wall=37, train_wall=30 | epoch 001: 20 / 11201 loss=27.443, nll_loss=27.443, ppl=182500427.22, wps=2449, ups=0, wpb=9216.000, bsz=3.000, num_updates=17, lr=0.0010626, gnorm=25.273, clip=1.000, oom=0.000, loss_scale=8.000, wall=64, train_wall=57 | epoch 001: 30 / 11201 loss=27.404, nll_loss=27.404, ppl=177612215.78, wps=2720, ups=0, wpb=9216.000, bsz=3.000, num_updates=27, lr=0.0016876, gnorm=25.136, clip=1.000, oom=0.000, loss_scale=8.000, wall=91, train_wall=84 | epoch 001: 40 / 11201 loss=27.009, nll_loss=27.009, ppl=135079983.00, wps=2865, ups=0, wpb=9216.000, bsz=3.000, num_updates=37, lr=0.0023126, gnorm=24.311, clip=1.000, oom=0.000, loss_scale=8.000, wall=119, train_wall=112 | epoch 001: 50 / 11201 loss=26.418, nll_loss=26.418, ppl=89680259.41, wps=2952, ups=0, wpb=9216.000, bsz=3.000, num_updates=47, lr=0.0029376, gnorm=22.775, clip=1.000, oom=0.000, loss_scale=8.000, wall=147, train_wall=140 with layerdrop (regularization effect should be seen in PPL): --decoder-layerdrop 0.2 | epoch 001: 10 / 11201 loss=25.186, nll_loss=25.186, ppl=38182937.27, wps=2428, ups=0, wpb=9216.000, bsz=3.000, num_updates=8, lr=0.0005001, gnorm=17.082, clip=1.000, oom=0.000, loss_scale=16.000, wall=30, train_wall=24 | epoch 001: 20 / 11201 loss=25.270, nll_loss=25.270, ppl=40451933.50, wps=3173, ups=0, wpb=9216.000, bsz=3.000, num_updates=18, lr=0.0011251, gnorm=17.162, clip=1.000, oom=0.000, loss_scale=16.000, wall=52, train_wall=45 | epoch 001: 30 / 11201 loss=25.349, nll_loss=25.349, ppl=42752256.68, wps=3454, ups=0, wpb=9216.000, bsz=3.000, num_updates=28, lr=0.0017501, gnorm=17.370, clip=1.000, oom=0.000, loss_scale=16.000, wall=75, train_wall=68 | epoch 001: 40 / 11201 loss=25.115, nll_loss=25.115, ppl=36343806.30, wps=3619, ups=0, wpb=9216.000, bsz=3.000, num_updates=38, lr=0.0023751, gnorm=16.945, clip=1.000, oom=0.000, loss_scale=16.000, wall=97, train_wall=90 | epoch 001: 50 / 11201 loss=24.804, nll_loss=24.804, ppl=29284345.78, wps=3716, ups=0, wpb=9216.000, bsz=3.000, num_updates=48, lr=0.0030001, gnorm=16.406, clip=1.000, oom=0.000, loss_scale=16.000, wall=119, train_wall=112 TEST 3: PICKING UP TRAINING FROM EXISTING MODEL checked | loaded checkpoint /checkpoint/angelafan/structured_0.1_block_8_sd02/checkpoint_last.pt (epoch 272 @ 381066 updates) | loading train data for epoch 272 | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train TEST 4: EVALUATING EXISTING BERT MODEL REPROS RESULTS | [input] dictionary: 50265 types | [label] dictionary: 9 types | Accuracy: 0.9231651376146789 achieves correct accuracy on SST2 for this model TEST 5: TRAINING NEW BERT MODEL WORKS checked and works TEST 6: NMT without layerdrop --encoder-layerdrop 0 --decoder-layerdrop 0 OR combinations of flag specified and not specified | epoch 001: 10 / 92203 loss=15.820, nll_loss=15.830, ppl=58267.93, wps=4902, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=7.207, clip=0.000, oom=0.000, loss_scale=128.000, wall=60, train_wall=3 | epoch 001: 20 / 92203 loss=15.523, nll_loss=15.501, ppl=46359.29, wps=5037, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.869, clip=0.000, oom=0.000, loss_scale=128.000, wall=63, train_wall=6 | epoch 001: 30 / 92203 loss=15.185, nll_loss=15.123, ppl=35695.79, wps=5085, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.186, clip=0.000, oom=0.000, loss_scale=128.000, wall=66, train_wall=9 | epoch 001: 40 / 92203 loss=14.940, nll_loss=14.849, ppl=29505.60, wps=5116, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=5.610, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=12 | epoch 001: 50 / 92203 loss=14.745, nll_loss=14.630, ppl=25346.87, wps=5070, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.104, clip=0.000, oom=0.000, loss_scale=128.000, wall=71, train_wall=15 with layerdrop (regularization effect should be seen in PPL) A) works with --encoder-layerdrop 0.2 --decoder-layerdrop 0.2 B) works with different settings --encoder-layerdrop 0.3 --decoder-layerdrop 0.5 C) works with one on and one off --encoder-layerdrop 0.2 --decoder-layerdrop 0 | epoch 001: 10 / 92203 loss=15.817, nll_loss=15.828, ppl=58158.54, wps=5355, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=6.959, clip=0.000, oom=0.000, loss_scale=128.000, wall=59, train_wall=3 | epoch 001: 20 / 92203 loss=15.650, nll_loss=15.641, ppl=51111.63, wps=5515, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.825, clip=0.000, oom=0.000, loss_scale=128.000, wall=61, train_wall=6 | epoch 001: 30 / 92203 loss=15.440, nll_loss=15.408, ppl=43491.58, wps=5602, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.576, clip=0.000, oom=0.000, loss_scale=128.000, wall=64, train_wall=8 | epoch 001: 40 / 92203 loss=15.247, nll_loss=15.193, ppl=37457.14, wps=5676, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=6.124, clip=0.000, oom=0.000, loss_scale=128.000, wall=67, train_wall=11 | epoch 001: 50 / 92203 loss=15.055, nll_loss=14.977, ppl=32259.92, wps=5598, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.661, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=14 TEST 7: PRUNING TESTCASES A) after adding the pruning flags, model can evaluate as a full model checked, reaches correct PPL num. model params: 246933504 | Evaluated 217646 tokens in 196.3s (1108.99 tokens/s) | Loss: 2.9275, Perplexity: 18.68 B) after adding pruning flags, model can be pruned. this works with multiple flag settings checked three cases: num. model params: 146163712 | Evaluated 217646 tokens in 106.0s (2054.07 tokens/s) | Loss: 3.0932, Perplexity: 22.05 num. model params: 209144832 | Evaluated 217646 tokens in 162.8s (1336.99 tokens/s) | Loss: 2.9526, Perplexity: 19.16 C) model can pick up training if you want to finetune the pruned model checked: | loading train data for epoch 272 | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train | WARNING: overflow detected, setting loss scale to: 64.0 | WARNING: overflow detected, setting loss scale to: 32.0 | epoch 272: 1500 / 5601 loss=5.015, nll_loss=5.015, ppl=32.33, wps=11598, ups=1, wpb=18432.000, bsz=6.000, num_updates=98, lr=0.0061251, gnorm=0.613, clip=1.000, oom=0.000, loss_scale=32.000, wall=156, train_wall=252396 D) works with BERT checked: without specifying any flags, reproduces the correct standard accuracy with flags, produces the correct pruned accuracy | [input] dictionary: 50265 types | [label] dictionary: 9 types | Accuracy: 0.9231651376146789 | [input] dictionary: 50265 types | [label] dictionary: 9 types | Pruning model to specified layer configuration - this works best if the model was trained with LayerDrop | Accuracy: 0.9220183486238532 Pull Request resolved: fairinternal/fairseq-py#890 Reviewed By: edunov Differential Revision: D18094657 Pulled By: huihuifan fbshipit-source-id: 2bbaa2ff0039e906782694fc2038b8c17a8693e7
ebetica
pushed a commit
to ebetica/fairseq
that referenced
this issue
Nov 20, 2019
…arch#890) Summary: TEST 1: EVALUATION TIME WORKS checked achieves correct model perplexity: 18.68 TEST 2: TRAINING NEW MODEL WORKS checked without layerdrop: --decoder-layerdrop 0 OR no flag at all | epoch 001: 10 / 11201 loss=27.469, nll_loss=27.469, ppl=185799477.36, wps=1764, ups=0, wpb=9216.000, bsz=3.000, num_updates=7, lr=0.0004376, gnorm=25.471, clip=1.000, oom=0.000, loss_scale=8.000, wall=37, train_wall=30 | epoch 001: 20 / 11201 loss=27.443, nll_loss=27.443, ppl=182500427.22, wps=2449, ups=0, wpb=9216.000, bsz=3.000, num_updates=17, lr=0.0010626, gnorm=25.273, clip=1.000, oom=0.000, loss_scale=8.000, wall=64, train_wall=57 | epoch 001: 30 / 11201 loss=27.404, nll_loss=27.404, ppl=177612215.78, wps=2720, ups=0, wpb=9216.000, bsz=3.000, num_updates=27, lr=0.0016876, gnorm=25.136, clip=1.000, oom=0.000, loss_scale=8.000, wall=91, train_wall=84 | epoch 001: 40 / 11201 loss=27.009, nll_loss=27.009, ppl=135079983.00, wps=2865, ups=0, wpb=9216.000, bsz=3.000, num_updates=37, lr=0.0023126, gnorm=24.311, clip=1.000, oom=0.000, loss_scale=8.000, wall=119, train_wall=112 | epoch 001: 50 / 11201 loss=26.418, nll_loss=26.418, ppl=89680259.41, wps=2952, ups=0, wpb=9216.000, bsz=3.000, num_updates=47, lr=0.0029376, gnorm=22.775, clip=1.000, oom=0.000, loss_scale=8.000, wall=147, train_wall=140 with layerdrop (regularization effect should be seen in PPL): --decoder-layerdrop 0.2 | epoch 001: 10 / 11201 loss=25.186, nll_loss=25.186, ppl=38182937.27, wps=2428, ups=0, wpb=9216.000, bsz=3.000, num_updates=8, lr=0.0005001, gnorm=17.082, clip=1.000, oom=0.000, loss_scale=16.000, wall=30, train_wall=24 | epoch 001: 20 / 11201 loss=25.270, nll_loss=25.270, ppl=40451933.50, wps=3173, ups=0, wpb=9216.000, bsz=3.000, num_updates=18, lr=0.0011251, gnorm=17.162, clip=1.000, oom=0.000, loss_scale=16.000, wall=52, train_wall=45 | epoch 001: 30 / 11201 loss=25.349, nll_loss=25.349, ppl=42752256.68, wps=3454, ups=0, wpb=9216.000, bsz=3.000, num_updates=28, lr=0.0017501, gnorm=17.370, clip=1.000, oom=0.000, loss_scale=16.000, wall=75, train_wall=68 | epoch 001: 40 / 11201 loss=25.115, nll_loss=25.115, ppl=36343806.30, wps=3619, ups=0, wpb=9216.000, bsz=3.000, num_updates=38, lr=0.0023751, gnorm=16.945, clip=1.000, oom=0.000, loss_scale=16.000, wall=97, train_wall=90 | epoch 001: 50 / 11201 loss=24.804, nll_loss=24.804, ppl=29284345.78, wps=3716, ups=0, wpb=9216.000, bsz=3.000, num_updates=48, lr=0.0030001, gnorm=16.406, clip=1.000, oom=0.000, loss_scale=16.000, wall=119, train_wall=112 TEST 3: PICKING UP TRAINING FROM EXISTING MODEL checked | loaded checkpoint /checkpoint/angelafan/structured_0.1_block_8_sd02/checkpoint_last.pt (epoch 272 @ 381066 updates) | loading train data for epoch 272 | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train TEST 4: EVALUATING EXISTING BERT MODEL REPROS RESULTS | [input] dictionary: 50265 types | [label] dictionary: 9 types | Accuracy: 0.9231651376146789 achieves correct accuracy on SST2 for this model TEST 5: TRAINING NEW BERT MODEL WORKS checked and works TEST 6: NMT without layerdrop --encoder-layerdrop 0 --decoder-layerdrop 0 OR combinations of flag specified and not specified | epoch 001: 10 / 92203 loss=15.820, nll_loss=15.830, ppl=58267.93, wps=4902, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=7.207, clip=0.000, oom=0.000, loss_scale=128.000, wall=60, train_wall=3 | epoch 001: 20 / 92203 loss=15.523, nll_loss=15.501, ppl=46359.29, wps=5037, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.869, clip=0.000, oom=0.000, loss_scale=128.000, wall=63, train_wall=6 | epoch 001: 30 / 92203 loss=15.185, nll_loss=15.123, ppl=35695.79, wps=5085, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.186, clip=0.000, oom=0.000, loss_scale=128.000, wall=66, train_wall=9 | epoch 001: 40 / 92203 loss=14.940, nll_loss=14.849, ppl=29505.60, wps=5116, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=5.610, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=12 | epoch 001: 50 / 92203 loss=14.745, nll_loss=14.630, ppl=25346.87, wps=5070, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.104, clip=0.000, oom=0.000, loss_scale=128.000, wall=71, train_wall=15 with layerdrop (regularization effect should be seen in PPL) A) works with --encoder-layerdrop 0.2 --decoder-layerdrop 0.2 B) works with different settings --encoder-layerdrop 0.3 --decoder-layerdrop 0.5 C) works with one on and one off --encoder-layerdrop 0.2 --decoder-layerdrop 0 | epoch 001: 10 / 92203 loss=15.817, nll_loss=15.828, ppl=58158.54, wps=5355, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=6.959, clip=0.000, oom=0.000, loss_scale=128.000, wall=59, train_wall=3 | epoch 001: 20 / 92203 loss=15.650, nll_loss=15.641, ppl=51111.63, wps=5515, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.825, clip=0.000, oom=0.000, loss_scale=128.000, wall=61, train_wall=6 | epoch 001: 30 / 92203 loss=15.440, nll_loss=15.408, ppl=43491.58, wps=5602, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.576, clip=0.000, oom=0.000, loss_scale=128.000, wall=64, train_wall=8 | epoch 001: 40 / 92203 loss=15.247, nll_loss=15.193, ppl=37457.14, wps=5676, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=6.124, clip=0.000, oom=0.000, loss_scale=128.000, wall=67, train_wall=11 | epoch 001: 50 / 92203 loss=15.055, nll_loss=14.977, ppl=32259.92, wps=5598, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.661, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=14 TEST 7: PRUNING TESTCASES A) after adding the pruning flags, model can evaluate as a full model checked, reaches correct PPL num. model params: 246933504 | Evaluated 217646 tokens in 196.3s (1108.99 tokens/s) | Loss: 2.9275, Perplexity: 18.68 B) after adding pruning flags, model can be pruned. this works with multiple flag settings checked three cases: num. model params: 146163712 | Evaluated 217646 tokens in 106.0s (2054.07 tokens/s) | Loss: 3.0932, Perplexity: 22.05 num. model params: 209144832 | Evaluated 217646 tokens in 162.8s (1336.99 tokens/s) | Loss: 2.9526, Perplexity: 19.16 C) model can pick up training if you want to finetune the pruned model checked: | loading train data for epoch 272 | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train | WARNING: overflow detected, setting loss scale to: 64.0 | WARNING: overflow detected, setting loss scale to: 32.0 | epoch 272: 1500 / 5601 loss=5.015, nll_loss=5.015, ppl=32.33, wps=11598, ups=1, wpb=18432.000, bsz=6.000, num_updates=98, lr=0.0061251, gnorm=0.613, clip=1.000, oom=0.000, loss_scale=32.000, wall=156, train_wall=252396 D) works with BERT checked: without specifying any flags, reproduces the correct standard accuracy with flags, produces the correct pruned accuracy | [input] dictionary: 50265 types | [label] dictionary: 9 types | Accuracy: 0.9231651376146789 | [input] dictionary: 50265 types | [label] dictionary: 9 types | Pruning model to specified layer configuration - this works best if the model was trained with LayerDrop | Accuracy: 0.9220183486238532 Pull Request resolved: fairinternal/fairseq-py#890 Reviewed By: edunov Differential Revision: D18094657 Pulled By: huihuifan fbshipit-source-id: 2bbaa2ff0039e906782694fc2038b8c17a8693e7
Closing due to inactivity. |
moussaKam
pushed a commit
to moussaKam/language-adaptive-pretraining
that referenced
this issue
Sep 29, 2020
…arch#890) Summary: TEST 1: EVALUATION TIME WORKS checked achieves correct model perplexity: 18.68 TEST 2: TRAINING NEW MODEL WORKS checked without layerdrop: --decoder-layerdrop 0 OR no flag at all | epoch 001: 10 / 11201 loss=27.469, nll_loss=27.469, ppl=185799477.36, wps=1764, ups=0, wpb=9216.000, bsz=3.000, num_updates=7, lr=0.0004376, gnorm=25.471, clip=1.000, oom=0.000, loss_scale=8.000, wall=37, train_wall=30 | epoch 001: 20 / 11201 loss=27.443, nll_loss=27.443, ppl=182500427.22, wps=2449, ups=0, wpb=9216.000, bsz=3.000, num_updates=17, lr=0.0010626, gnorm=25.273, clip=1.000, oom=0.000, loss_scale=8.000, wall=64, train_wall=57 | epoch 001: 30 / 11201 loss=27.404, nll_loss=27.404, ppl=177612215.78, wps=2720, ups=0, wpb=9216.000, bsz=3.000, num_updates=27, lr=0.0016876, gnorm=25.136, clip=1.000, oom=0.000, loss_scale=8.000, wall=91, train_wall=84 | epoch 001: 40 / 11201 loss=27.009, nll_loss=27.009, ppl=135079983.00, wps=2865, ups=0, wpb=9216.000, bsz=3.000, num_updates=37, lr=0.0023126, gnorm=24.311, clip=1.000, oom=0.000, loss_scale=8.000, wall=119, train_wall=112 | epoch 001: 50 / 11201 loss=26.418, nll_loss=26.418, ppl=89680259.41, wps=2952, ups=0, wpb=9216.000, bsz=3.000, num_updates=47, lr=0.0029376, gnorm=22.775, clip=1.000, oom=0.000, loss_scale=8.000, wall=147, train_wall=140 with layerdrop (regularization effect should be seen in PPL): --decoder-layerdrop 0.2 | epoch 001: 10 / 11201 loss=25.186, nll_loss=25.186, ppl=38182937.27, wps=2428, ups=0, wpb=9216.000, bsz=3.000, num_updates=8, lr=0.0005001, gnorm=17.082, clip=1.000, oom=0.000, loss_scale=16.000, wall=30, train_wall=24 | epoch 001: 20 / 11201 loss=25.270, nll_loss=25.270, ppl=40451933.50, wps=3173, ups=0, wpb=9216.000, bsz=3.000, num_updates=18, lr=0.0011251, gnorm=17.162, clip=1.000, oom=0.000, loss_scale=16.000, wall=52, train_wall=45 | epoch 001: 30 / 11201 loss=25.349, nll_loss=25.349, ppl=42752256.68, wps=3454, ups=0, wpb=9216.000, bsz=3.000, num_updates=28, lr=0.0017501, gnorm=17.370, clip=1.000, oom=0.000, loss_scale=16.000, wall=75, train_wall=68 | epoch 001: 40 / 11201 loss=25.115, nll_loss=25.115, ppl=36343806.30, wps=3619, ups=0, wpb=9216.000, bsz=3.000, num_updates=38, lr=0.0023751, gnorm=16.945, clip=1.000, oom=0.000, loss_scale=16.000, wall=97, train_wall=90 | epoch 001: 50 / 11201 loss=24.804, nll_loss=24.804, ppl=29284345.78, wps=3716, ups=0, wpb=9216.000, bsz=3.000, num_updates=48, lr=0.0030001, gnorm=16.406, clip=1.000, oom=0.000, loss_scale=16.000, wall=119, train_wall=112 TEST 3: PICKING UP TRAINING FROM EXISTING MODEL checked | loaded checkpoint /checkpoint/angelafan/structured_0.1_block_8_sd02/checkpoint_last.pt (epoch 272 @ 381066 updates) | loading train data for epoch 272 | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train TEST 4: EVALUATING EXISTING BERT MODEL REPROS RESULTS | [input] dictionary: 50265 types | [label] dictionary: 9 types | Accuracy: 0.9231651376146789 achieves correct accuracy on SST2 for this model TEST 5: TRAINING NEW BERT MODEL WORKS checked and works TEST 6: NMT without layerdrop --encoder-layerdrop 0 --decoder-layerdrop 0 OR combinations of flag specified and not specified | epoch 001: 10 / 92203 loss=15.820, nll_loss=15.830, ppl=58267.93, wps=4902, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=7.207, clip=0.000, oom=0.000, loss_scale=128.000, wall=60, train_wall=3 | epoch 001: 20 / 92203 loss=15.523, nll_loss=15.501, ppl=46359.29, wps=5037, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.869, clip=0.000, oom=0.000, loss_scale=128.000, wall=63, train_wall=6 | epoch 001: 30 / 92203 loss=15.185, nll_loss=15.123, ppl=35695.79, wps=5085, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.186, clip=0.000, oom=0.000, loss_scale=128.000, wall=66, train_wall=9 | epoch 001: 40 / 92203 loss=14.940, nll_loss=14.849, ppl=29505.60, wps=5116, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=5.610, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=12 | epoch 001: 50 / 92203 loss=14.745, nll_loss=14.630, ppl=25346.87, wps=5070, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.104, clip=0.000, oom=0.000, loss_scale=128.000, wall=71, train_wall=15 with layerdrop (regularization effect should be seen in PPL) A) works with --encoder-layerdrop 0.2 --decoder-layerdrop 0.2 B) works with different settings --encoder-layerdrop 0.3 --decoder-layerdrop 0.5 C) works with one on and one off --encoder-layerdrop 0.2 --decoder-layerdrop 0 | epoch 001: 10 / 92203 loss=15.817, nll_loss=15.828, ppl=58158.54, wps=5355, ups=0, wpb=1477.818, bsz=51.636, num_updates=11, lr=1.47473e-06, gnorm=6.959, clip=0.000, oom=0.000, loss_scale=128.000, wall=59, train_wall=3 | epoch 001: 20 / 92203 loss=15.650, nll_loss=15.641, ppl=51111.63, wps=5515, ups=0, wpb=1496.476, bsz=45.333, num_updates=21, lr=2.72448e-06, gnorm=6.825, clip=0.000, oom=0.000, loss_scale=128.000, wall=61, train_wall=6 | epoch 001: 30 / 92203 loss=15.440, nll_loss=15.408, ppl=43491.58, wps=5602, ups=0, wpb=1519.355, bsz=44.645, num_updates=31, lr=3.97423e-06, gnorm=6.576, clip=0.000, oom=0.000, loss_scale=128.000, wall=64, train_wall=8 | epoch 001: 40 / 92203 loss=15.247, nll_loss=15.193, ppl=37457.14, wps=5676, ups=1, wpb=1521.244, bsz=42.927, num_updates=41, lr=5.22398e-06, gnorm=6.124, clip=0.000, oom=0.000, loss_scale=128.000, wall=67, train_wall=11 | epoch 001: 50 / 92203 loss=15.055, nll_loss=14.977, ppl=32259.92, wps=5598, ups=1, wpb=1507.961, bsz=41.725, num_updates=51, lr=6.47373e-06, gnorm=5.661, clip=0.000, oom=0.000, loss_scale=128.000, wall=69, train_wall=14 TEST 7: PRUNING TESTCASES A) after adding the pruning flags, model can evaluate as a full model checked, reaches correct PPL num. model params: 246933504 | Evaluated 217646 tokens in 196.3s (1108.99 tokens/s) | Loss: 2.9275, Perplexity: 18.68 B) after adding pruning flags, model can be pruned. this works with multiple flag settings checked three cases: num. model params: 146163712 | Evaluated 217646 tokens in 106.0s (2054.07 tokens/s) | Loss: 3.0932, Perplexity: 22.05 num. model params: 209144832 | Evaluated 217646 tokens in 162.8s (1336.99 tokens/s) | Loss: 2.9526, Perplexity: 19.16 C) model can pick up training if you want to finetune the pruned model checked: | loading train data for epoch 272 | loaded 1801350 examples from: /private/home/angelafan/lm_work/fairseq-py/data-bin/wikitext-103/train | WARNING: overflow detected, setting loss scale to: 64.0 | WARNING: overflow detected, setting loss scale to: 32.0 | epoch 272: 1500 / 5601 loss=5.015, nll_loss=5.015, ppl=32.33, wps=11598, ups=1, wpb=18432.000, bsz=6.000, num_updates=98, lr=0.0061251, gnorm=0.613, clip=1.000, oom=0.000, loss_scale=32.000, wall=156, train_wall=252396 D) works with BERT checked: without specifying any flags, reproduces the correct standard accuracy with flags, produces the correct pruned accuracy | [input] dictionary: 50265 types | [label] dictionary: 9 types | Accuracy: 0.9231651376146789 | [input] dictionary: 50265 types | [label] dictionary: 9 types | Pruning model to specified layer configuration - this works best if the model was trained with LayerDrop | Accuracy: 0.9220183486238532 Pull Request resolved: fairinternal/fairseq-py#890 Reviewed By: edunov Differential Revision: D18094657 Pulled By: huihuifan fbshipit-source-id: 2bbaa2ff0039e906782694fc2038b8c17a8693e7
yfyeung
pushed a commit
to yfyeung/fairseq
that referenced
this issue
Dec 6, 2023
# for free
to join this conversation on GitHub.
Already have an account?
# to comment
I want to use the basic transformer architecture to do multilingual translation, that is, instead of having separate encoders and decoders for each language pair, I want one encoder and decoder for all language pairs, thus the batches should be balanced for all language pairs according to the corpus size, in order to prevent overfitting one language pair. But I find it hard to implement, since the base transformer only reads in one dataset and generates batches according to sequence length. If I have a dataset that contains all languages pairs, the batch generator still cannot guarantee to generate balanced batches. Is there a way to do this? Thanks
The text was updated successfully, but these errors were encountered: