-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
support pp accuracy calculation #9379
Conversation
Thanks for your contribution! |
paddlenlp/trainer/trainer.py
Outdated
if pp_group.nranks > 1: | ||
logit_shape = [[]] | ||
if "pp_logits" in infohub: | ||
logits = paddle.concat(infohub["pp_logits"], axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为啥是concat了?不是很理解
…nto support-pp-acc
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #9379 +/- ##
===========================================
+ Coverage 52.93% 53.10% +0.17%
===========================================
Files 688 694 +6
Lines 109379 110966 +1587
===========================================
+ Hits 57899 58930 +1031
- Misses 51480 52036 +556 ☔ View full report in Codecov by Sentry. |
paddlenlp/trainer/trainer.py
Outdated
# evaluation dont support drop last, | ||
# so set the `accumulate_steps` to actually | ||
# eval batch size. | ||
model_config_backup = model.accumulate_steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个命名是不是不太规范? 很明显这个又不是一个model config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddlenlp/trainer/trainer.py
Outdated
logits = None | ||
if "pp_logits" in infohub: | ||
logits = paddle.concat(infohub["pp_logits"], axis=0) | ||
logits = logits._copy_to(paddle.framework._current_expected_place(), False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里拷贝的原因是pp_logits是放在cpu memory 或者 cuda pin memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的,如果这里不放在cpu或者 pin memory 会在 concat 的时候造成增加两倍 logits 大小的峰值显存,导致 OOM
paddlenlp/trainer/trainer.py
Outdated
@@ -3312,6 +3347,8 @@ def prediction_step( | |||
if self.args.pipeline_parallel_degree > 1: | |||
# hack for pipeline mode | |||
inputs = self._prepare_inputs(inputs) | |||
if self.args.metric_for_best_model == "accuracy": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个建议不要放在trainer,放在SFTTrainer更加合理
paddlenlp/trainer/trainer.py
Outdated
# evaluation dont support drop last, | ||
# so set the `accumulate_steps` to actually | ||
# eval batch size. | ||
model_config_backup = model.accumulate_steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddlenlp/trainer/trainer.py
Outdated
else: | ||
input_ids = inputs | ||
|
||
model.accumulate_steps = input_ids.shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
要不就要把model.micro_batch_size直接设为1
…nto support-pp-acc Conflicts: paddlenlp/trainer/trainer.py
@@ -81,6 +81,7 @@ | |||
"fp16_opt_level": "O2", | |||
"max_grad_norm": 1.0, | |||
"dataloader_num_workers": 0, | |||
"metric_for_best_model": "accuracy", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
后续也在开源模型适配
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* support pp accuracy calculation * add pp accuracy ci * add comment * update * mv logits accumulation to cpu * refactor code * code refactor * remove ci, not support yet * update
PR types
PR changes
Description