Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

RL results are worse than Mle results in Rouge-1,2 #26

Open
pengzhi123 opened this issue Apr 14, 2020 · 3 comments
Open

RL results are worse than Mle results in Rouge-1,2 #26

pengzhi123 opened this issue Apr 14, 2020 · 3 comments

Comments

@pengzhi123
Copy link

pengzhi123 commented Apr 14, 2020

hi, we found the best model (0050000.tar) for rl training after mle training, Although the rouge-L score improved, but the rouge-1 and rouge-2 score became very bad .
we show the eval:(we use rouge-1 for evaluation).
mle (official testset):
Training mle: yes, Training rl: no, mle weight: 1.00, rl weight: 0.00
intra_encoder: True intra_decoder: True
0005000.tar rouge_1: 0.3174
0010000.tar rouge_1: 0.3249
0015000.tar rouge_1: 0.3289
0020000.tar rouge_1: 0.3325
0025000.tar rouge_1: 0.3331
0030000.tar rouge_1: 0.3357
0035000.tar rouge_1: 0.3379
0040000.tar rouge_1: 0.3355
0045000.tar rouge_1: 0.3382
0050000.tar rouge_1: 0.3426
0055000.tar rouge_1: 0.3384
0060000.tar rouge_1: 0.3339
0065000.tar rouge_1: 0.3410
0070000.tar rouge_1: 0.3408
0075000.tar rouge_1: 0.3425
0080000.tar rouge_1: 0.3384
0085000.tar rouge_1: 0.3362
0090000.tar rouge_1: 0.3424
0095000.tar rouge_1: 0.3377
0100000.tar rouge_1: 0.3361
0105000.tar rouge_1: 0.3357
0110000.tar rouge_1: 0.3389
0115000.tar rouge_1: 0.3374
0120000.tar rouge_1: 0.3341
0125000.tar rouge_1: 0.3357
0130000.tar rouge_1: 0.3377
0135000.tar rouge_1: 0.3317
0140000.tar rouge_1: 0.3321
0145000.tar rouge_1: 0.3349
0150000.tar rouge_1: 0.3363
rl (official testset):
in_rl=yes --mle_weight=0.0 --load_model=0050000.tar --new_lr=0.0001
Training mle: no, Training rl: yes, mle weight: 0.00, rl weight: 1.00
intra_encoder: True intra_decoder: True
Loaded model at data/saved_models/0050000.tar
0050000.tar rouge_1: 0.3426
0055000.tar rouge_1: 0.2522
0060000.tar rouge_1: 0.2520
0065000.tar rouge_1: 0.2549
0070000.tar rouge_1: 0.2550
0075000.tar rouge_1: 0.2547
0080000.tar rouge_1: 0.2584
0085000.tar rouge_1: 0.2576
0090000.tar rouge_1: 0.2543
0095000.tar rouge_1: 0.2567
0100000.tar rouge_1: 0.2562
0105000.tar rouge_1: 0.2556
0110000.tar rouge_1: 0.2547
0115000.tar rouge_1: 0.2575
0120000.tar rouge_1: 0.2543
0125000.tar rouge_1: 0.2581
0130000.tar rouge_1: 0.2534
0135000.tar rouge_1: 0.2533
0140000.tar rouge_1: 0.2526
0145000.tar rouge_1: 0.2511
0150000.tar rouge_1: 0.2547

mle result:
0075000.tar scores: {'rouge-1': {'f': 0.3424728366572667, 'p': 0.39166721241721236, 'r': 0.31968494072078807}, 'rouge-2': {'f': 0.1732520206640223, 'p': 0.19845553983053968, 'r': 0.1623725413112666}, 'rouge-l': {'f': 0.32962985739519235, 'p': 0.3758193750693756, 'r': 0.3075168451832533}}

rl result:
0080000.tar scores: {'rouge-1': {'f': 0.2574669041724543, 'p': 0.21302155489848726, 'r': 0.34803503077209935}, 'rouge-2': {'f': 0.11896310475645827, 'p': 0.09758671687502977, 'r': 0.16587082443700088}, 'rouge-l': {'f': 0.35379459020991105, 'p': 0.39799812070645335, 'r': 0.33855028225319733}}

0125000.tar scores: {'rouge-1': {'f': 0.25674349158898563, 'p': 0.21440196978373974, 'r': 0.34277860517537473}, 'rouge-2': {'f': 0.11907341598225046, 'p': 0.09900864566338015, 'r': 0.16306397570581008}, 'rouge-l': {'f': 0.35462601354567735, 'p': 0.40579230645897313, 'r': 0.33368591052575747}}
thanks for your help!

@pengzhi123 pengzhi123 changed the title RL results are worse than Mle results RL results are worse than Mle results in Rouge-1 Apr 14, 2020
@pengzhi123 pengzhi123 changed the title RL results are worse than Mle results in Rouge-1 RL results are worse than Mle results in Rouge-1,2 Apr 14, 2020
@nkathireshan
Copy link

hi, we found the best model (0050000.tar) for rl training after mle training, Although the rouge-L score improved, but the rouge-1 and rouge-2 score became very bad .
we show the eval:(we use rouge-1 for evaluation).
mle (official testset):
Training mle: yes, Training rl: no, mle weight: 1.00, rl weight: 0.00
intra_encoder: True intra_decoder: True
0005000.tar rouge_1: 0.3174
0010000.tar rouge_1: 0.3249
0015000.tar rouge_1: 0.3289
0020000.tar rouge_1: 0.3325
0025000.tar rouge_1: 0.3331
0030000.tar rouge_1: 0.3357
0035000.tar rouge_1: 0.3379
0040000.tar rouge_1: 0.3355
0045000.tar rouge_1: 0.3382
0050000.tar rouge_1: 0.3426
0055000.tar rouge_1: 0.3384
0060000.tar rouge_1: 0.3339
0065000.tar rouge_1: 0.3410
0070000.tar rouge_1: 0.3408
0075000.tar rouge_1: 0.3425
0080000.tar rouge_1: 0.3384
0085000.tar rouge_1: 0.3362
0090000.tar rouge_1: 0.3424
0095000.tar rouge_1: 0.3377
0100000.tar rouge_1: 0.3361
0105000.tar rouge_1: 0.3357
0110000.tar rouge_1: 0.3389
0115000.tar rouge_1: 0.3374
0120000.tar rouge_1: 0.3341
0125000.tar rouge_1: 0.3357
0130000.tar rouge_1: 0.3377
0135000.tar rouge_1: 0.3317
0140000.tar rouge_1: 0.3321
0145000.tar rouge_1: 0.3349
0150000.tar rouge_1: 0.3363
rl (official testset):
in_rl=yes --mle_weight=0.0 --load_model=0050000.tar --new_lr=0.0001
Training mle: no, Training rl: yes, mle weight: 0.00, rl weight: 1.00
intra_encoder: True intra_decoder: True
Loaded model at data/saved_models/0050000.tar
0050000.tar rouge_1: 0.3426
0055000.tar rouge_1: 0.2522
0060000.tar rouge_1: 0.2520
0065000.tar rouge_1: 0.2549
0070000.tar rouge_1: 0.2550
0075000.tar rouge_1: 0.2547
0080000.tar rouge_1: 0.2584
0085000.tar rouge_1: 0.2576
0090000.tar rouge_1: 0.2543
0095000.tar rouge_1: 0.2567
0100000.tar rouge_1: 0.2562
0105000.tar rouge_1: 0.2556
0110000.tar rouge_1: 0.2547
0115000.tar rouge_1: 0.2575
0120000.tar rouge_1: 0.2543
0125000.tar rouge_1: 0.2581
0130000.tar rouge_1: 0.2534
0135000.tar rouge_1: 0.2533
0140000.tar rouge_1: 0.2526
0145000.tar rouge_1: 0.2511
0150000.tar rouge_1: 0.2547

mle result:
0075000.tar scores: {'rouge-1': {'f': 0.3424728366572667, 'p': 0.39166721241721236, 'r': 0.31968494072078807}, 'rouge-2': {'f': 0.1732520206640223, 'p': 0.19845553983053968, 'r': 0.1623725413112666}, 'rouge-l': {'f': 0.32962985739519235, 'p': 0.3758193750693756, 'r': 0.3075168451832533}}

rl result:
0080000.tar scores: {'rouge-1': {'f': 0.2574669041724543, 'p': 0.21302155489848726, 'r': 0.34803503077209935}, 'rouge-2': {'f': 0.11896310475645827, 'p': 0.09758671687502977, 'r': 0.16587082443700088}, 'rouge-l': {'f': 0.35379459020991105, 'p': 0.39799812070645335, 'r': 0.33855028225319733}}

0125000.tar scores: {'rouge-1': {'f': 0.25674349158898563, 'p': 0.21440196978373974, 'r': 0.34277860517537473}, 'rouge-2': {'f': 0.11907341598225046, 'p': 0.09900864566338015, 'r': 0.16306397570581008}, 'rouge-l': {'f': 0.35462601354567735, 'p': 0.40579230645897313, 'r': 0.33368591052575747}}
thanks for your help!

@pengzhi123 can you please let me know the system specification that you have used? I am trying to run this in windows machine with 32 GB RAM, I don't have CUDA enabled in my system.
I doubt this code won't run properly in a windows environment? please advise

@pengzhi123
Copy link
Author

嗨,我们找到了进行mle训练后rl训练的最佳模型(0050000.tar),尽管rouge-L得分有所提高,但rouge-1和rouge-2得分却很差。
我们显示eval :(我们使用rouge-1进行评估)。
mle(官方测试集):
训练mle:是,训练rl:否,mle权重:1.00,rl权重:0.00
intra_encoder:真正的intra_decoder:真正
0005000.tar rouge_1:0.3174
0010000.tar rouge_1:0.3249
0015000.tar rouge_1:0.3289
0020000 .tar rouge_1:0.3325
0025000.tar rouge_1:0.3331
0030000.tar rouge_1:0.3357
0035000.tar rouge_1:0.3379
0040000.tar rouge_1:0.3355
0045000.tar rouge_1:0.3382
0050000.tar rouge_1:0.3426
0055000.tar rouge_1:0.3384
0060000.tar rouge_1:0.3339
0065000.tar rouge_1:0.3410
0070000.tar rouge_1:0.3408
0075000.tar rouge_1:0.3425
0080000.tar rouge_1:0.3384
0085000.tar rouge_1:0.3362
0090000.tar rouge_1:0.3424
0095000。 tar rouge_1:0.3377
0100000.tar rouge_1:0.3361
0105000.tar rouge_1:0.3357
0110000.tar rouge_1:0.3389
0115000.tar rouge_1:0.3374
0120000.tar rouge_1:0.3341
0125000.tar rouge_1:0.3357
0130000.tar rouge_1:0.3377
0135000.tar rouge_1 :0.3317
0140000.tar rouge_1:0.3321
0145000.tar rouge_1:0.3349
0150000.tar rouge_1:0.3363
rl(官方测试集):
in_rl = yes --mle_weight = 0.0 --load_model = 0050000.tar --new_lr = 0.0001
Training mle:no,Training rl:yes,mle weight:0.00,rl weight:1.00
intra_encoder:True intra_decoder:True
在数据/处加载模型saved_models / 0050000.tar
0050000.tar rouge_1:0.3426
0055000.tar rouge_1:0.2522
0060000.tar rouge_1:0.2520
0065000.tar rouge_1:0.2549
0070000.tar rouge_1:0.2550
0075000.tar rouge_1:0.2547
0080000.tar rouge_1:0.2584
0085000.tar rouge_1:0.2576
0090000.tar rouge_1:0.2543
0095000.tar rouge_1:0.2567
0100000.tar rouge_1:0.2562
0105000.tar rouge_1:0.2556
0110000.tar rouge_1:0.2547
0115000.tar rouge_1:0.2575
0120000.tar rouge_1:0.2543
0125000.tar rouge_1:0.2581
0130000.tar rouge_1:0.2534
0135000.tar rouge_1:0.2533
0140000.tar rouge_1:0.2526
0145000.tar rouge_1:0.2511
0150000.tar rouge_1:0.2547
mle结果:
0075000.tar得分:{'rouge-1':{'f':0.3424728366572667,'p':0.39166721241721236,'r':0.31968494072078807},'rouge-2':{'f':0.1732520206640223,'p ':0.19845553983053968,'r':0.1623725413112666},'rouge-l':{'f':0.32962985739519235,'p':0.3758193750693756,'r':0.3075168451832533}}}
rl结果:
0080000.tar得分:{'rouge-1':{'f':0.2574669041724543,'p':0.21302155489848726,'r':0.34803503077209935},'rouge-2':{'f':0.11896310475645827,'p ':0.09758671687502977,'r':0.16587082443700088},'rouge-l':{'f':0.35379459020991105,'p':0.39799812070645335,'r':0.33855028225319733}}
0125000.tar得分:{'rouge-1':{'f':0.25674349158898563,'p':0.21440196978373974,'r':0.34277860517537473},'rouge-2':{'f':0.11907341598225046,'p':0.09900864566338015 ,'r':0.16306397570581008},'rouge-l':{'f':0.35462601354567735,'p':0.40579230645897313,'r':0.33368591052575747}}
感谢您的帮助!

@ pengzhi123 您可以让我知道您使用的系统规格吗?我试图在具有32 GB RAM的Windows计算机中运行此程序,但我的系统未启用CUDA。
我怀疑这段代码无法在Windows环境中正常运行吗?请指教

You should use ubuntu, not windows.

hi, we found the best model (0050000.tar) for rl training after mle training, Although the rouge-L score improved, but the rouge-1 and rouge-2 score became very bad .
we show the eval:(we use rouge-1 for evaluation).
mle (official testset):
Training mle: yes, Training rl: no, mle weight: 1.00, rl weight: 0.00
intra_encoder: True intra_decoder: True
0005000.tar rouge_1: 0.3174
0010000.tar rouge_1: 0.3249
0015000.tar rouge_1: 0.3289
0020000.tar rouge_1: 0.3325
0025000.tar rouge_1: 0.3331
0030000.tar rouge_1: 0.3357
0035000.tar rouge_1: 0.3379
0040000.tar rouge_1: 0.3355
0045000.tar rouge_1: 0.3382
0050000.tar rouge_1: 0.3426
0055000.tar rouge_1: 0.3384
0060000.tar rouge_1: 0.3339
0065000.tar rouge_1: 0.3410
0070000.tar rouge_1: 0.3408
0075000.tar rouge_1: 0.3425
0080000.tar rouge_1: 0.3384
0085000.tar rouge_1: 0.3362
0090000.tar rouge_1: 0.3424
0095000.tar rouge_1: 0.3377
0100000.tar rouge_1: 0.3361
0105000.tar rouge_1: 0.3357
0110000.tar rouge_1: 0.3389
0115000.tar rouge_1: 0.3374
0120000.tar rouge_1: 0.3341
0125000.tar rouge_1: 0.3357
0130000.tar rouge_1: 0.3377
0135000.tar rouge_1: 0.3317
0140000.tar rouge_1: 0.3321
0145000.tar rouge_1: 0.3349
0150000.tar rouge_1: 0.3363
rl (official testset):
in_rl=yes --mle_weight=0.0 --load_model=0050000.tar --new_lr=0.0001
Training mle: no, Training rl: yes, mle weight: 0.00, rl weight: 1.00
intra_encoder: True intra_decoder: True
Loaded model at data/saved_models/0050000.tar
0050000.tar rouge_1: 0.3426
0055000.tar rouge_1: 0.2522
0060000.tar rouge_1: 0.2520
0065000.tar rouge_1: 0.2549
0070000.tar rouge_1: 0.2550
0075000.tar rouge_1: 0.2547
0080000.tar rouge_1: 0.2584
0085000.tar rouge_1: 0.2576
0090000.tar rouge_1: 0.2543
0095000.tar rouge_1: 0.2567
0100000.tar rouge_1: 0.2562
0105000.tar rouge_1: 0.2556
0110000.tar rouge_1: 0.2547
0115000.tar rouge_1: 0.2575
0120000.tar rouge_1: 0.2543
0125000.tar rouge_1: 0.2581
0130000.tar rouge_1: 0.2534
0135000.tar rouge_1: 0.2533
0140000.tar rouge_1: 0.2526
0145000.tar rouge_1: 0.2511
0150000.tar rouge_1: 0.2547
mle result:
0075000.tar scores: {'rouge-1': {'f': 0.3424728366572667, 'p': 0.39166721241721236, 'r': 0.31968494072078807}, 'rouge-2': {'f': 0.1732520206640223, 'p': 0.19845553983053968, 'r': 0.1623725413112666}, 'rouge-l': {'f': 0.32962985739519235, 'p': 0.3758193750693756, 'r': 0.3075168451832533}}
rl result:
0080000.tar scores: {'rouge-1': {'f': 0.2574669041724543, 'p': 0.21302155489848726, 'r': 0.34803503077209935}, 'rouge-2': {'f': 0.11896310475645827, 'p': 0.09758671687502977, 'r': 0.16587082443700088}, 'rouge-l': {'f': 0.35379459020991105, 'p': 0.39799812070645335, 'r': 0.33855028225319733}}
0125000.tar scores: {'rouge-1': {'f': 0.25674349158898563, 'p': 0.21440196978373974, 'r': 0.34277860517537473}, 'rouge-2': {'f': 0.11907341598225046, 'p': 0.09900864566338015, 'r': 0.16306397570581008}, 'rouge-l': {'f': 0.35462601354567735, 'p': 0.40579230645897313, 'r': 0.33368591052575747}}
thanks for your help!

@pengzhi123 can you please let me know the system specification that you have used? I am trying to run this in windows machine with 32 GB RAM, I don't have CUDA enabled in my system.
I doubt this code won't run properly in a windows environment? please advise

You should use ubuntu, not windows.

@Berylv587
Copy link

hello, I met "stopIeration error' when runing eval.py. Have you ever met the same problem? If you have, please tell me how to sovle it.
image

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants