support pp accuracy calculation #9379

wtmlon · 2024-11-06T11:23:06Z

PR types

PR changes

Description

paddle-bot · 2024-11-06T11:23:12Z

Thanks for your contribution!

CLAassistant · 2024-11-06T11:23:14Z

All committers have signed the CLA.

DesmonDay · 2024-11-07T05:29:58Z

paddlenlp/trainer/trainer.py

+            if pp_group.nranks > 1:
+                logit_shape = [[]]
+                if "pp_logits" in infohub:
+                    logits = paddle.concat(infohub["pp_logits"], axis=0)


这里为啥是concat了？不是很理解

…nto support-pp-acc

codecov · 2024-11-12T08:41:12Z

Codecov Report

Attention: Patch coverage is 0% with 2 lines in your changes missing coverage. Please review.

Project coverage is 53.10%. Comparing base (4b02477) to head (c0645e7).
Report is 13 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/trainer/trainer.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9379      +/-   ##
===========================================
+ Coverage    52.93%   53.10%   +0.17%     
===========================================
  Files          688      694       +6     
  Lines       109379   110966    +1587     
===========================================
+ Hits         57899    58930    +1031     
- Misses       51480    52036     +556

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wawltor · 2024-11-13T03:35:56Z

paddlenlp/trainer/trainer.py

+        # evaluation dont support drop last,
+        # so set the `accumulate_steps` to actually
+        # eval batch size.
+        model_config_backup = model.accumulate_steps


这个命名是不是不太规范？很明显这个又不是一个model config

不能只修改accumulate_steps而不修改micro_batch_size，
https://github.com/PaddlePaddle/Paddle/blob/52f55e159fd8235c841985578e380b9d9dc3a220/python/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py#L271C1-L275C31

wawltor · 2024-11-13T03:37:39Z

paddlenlp/trainer/trainer.py

+        logits = None
+        if "pp_logits" in infohub:
+            logits = paddle.concat(infohub["pp_logits"], axis=0)
+            logits = logits._copy_to(paddle.framework._current_expected_place(), False)


这里拷贝的原因是pp_logits是放在cpu memory 或者 cuda pin memory？

是的，如果这里不放在cpu或者 pin memory 会在 concat 的时候造成增加两倍 logits 大小的峰值显存，导致 OOM

lugimzzz · 2024-11-15T06:20:50Z

paddlenlp/trainer/trainer.py

@@ -3312,6 +3347,8 @@ def prediction_step(
        if self.args.pipeline_parallel_degree > 1:
            # hack for pipeline mode
            inputs = self._prepare_inputs(inputs)
+            if self.args.metric_for_best_model == "accuracy":


这个建议不要放在trainer，放在SFTTrainer更加合理

lugimzzz · 2024-11-15T06:43:06Z

paddlenlp/trainer/trainer.py

+        # evaluation dont support drop last,
+        # so set the `accumulate_steps` to actually
+        # eval batch size.
+        model_config_backup = model.accumulate_steps


不能只修改accumulate_steps而不修改micro_batch_size，
https://github.com/PaddlePaddle/Paddle/blob/52f55e159fd8235c841985578e380b9d9dc3a220/python/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py#L271C1-L275C31

lugimzzz · 2024-11-15T06:45:30Z

paddlenlp/trainer/trainer.py

+        else:
+            input_ids = inputs
+
+        model.accumulate_steps = input_ids.shape[0]


要不就要把model.micro_batch_size直接设为1

…nto support-pp-acc Conflicts: paddlenlp/trainer/trainer.py

lugimzzz · 2024-11-26T12:40:11Z

tests/trainer/test_unified_checkpoint.py

@@ -81,6 +81,7 @@
    "fp16_opt_level": "O2",
    "max_grad_norm": 1.0,
    "dataloader_num_workers": 0,
+    "metric_for_best_model": "accuracy",


后续也在开源模型适配

lugimzzz

LGTM

lugimzzz

lgtm

wawltor

LGTM

* support pp accuracy calculation * add pp accuracy ci * add comment * update * mv logits accumulation to cpu * refactor code * code refactor * remove ci, not support yet * update

support pp accuracy calculation

c749b46

wtmlon requested a review from DesmonDay November 6, 2024 11:23

wtmlon added 2 commits November 7, 2024 11:03

add pp accuracy ci

eb799c8

add comment

b1e3d1a

DesmonDay reviewed Nov 7, 2024

View reviewed changes

wtmlon added 2 commits November 12, 2024 16:06

update

eb6c4eb

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

47d6472

…nto support-pp-acc

mv logits accumulation to cpu

6c8adfc

wawltor reviewed Nov 13, 2024

View reviewed changes

ZHUI self-requested a review November 13, 2024 07:20

lugimzzz reviewed Nov 15, 2024

View reviewed changes

wtmlon added 3 commits November 18, 2024 16:24

refactor code

91a2234

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

0780d2a

…nto support-pp-acc Conflicts: paddlenlp/trainer/trainer.py

code refactor

f9dd719

lugimzzz reviewed Nov 26, 2024

View reviewed changes

remove ci, not support yet

9b901e6

lugimzzz previously approved these changes Nov 27, 2024

View reviewed changes

update

c0645e7

wtmlon dismissed lugimzzz’s stale review via c0645e7 November 27, 2024 09:30

lugimzzz approved these changes Nov 27, 2024

View reviewed changes

wawltor approved these changes Nov 29, 2024

View reviewed changes

wawltor merged commit 741785a into PaddlePaddle:develop Nov 29, 2024
9 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support pp accuracy calculation #9379

support pp accuracy calculation #9379

wtmlon commented Nov 6, 2024

paddle-bot bot commented Nov 6, 2024

CLAassistant commented Nov 6, 2024 •

edited

Loading

DesmonDay Nov 7, 2024

codecov bot commented Nov 12, 2024 •

edited

Loading

wawltor Nov 13, 2024

lugimzzz Nov 15, 2024

wawltor Nov 13, 2024

wtmlon Nov 13, 2024

lugimzzz Nov 15, 2024

lugimzzz Nov 15, 2024

lugimzzz Nov 15, 2024

lugimzzz Nov 26, 2024

lugimzzz left a comment

lugimzzz left a comment

wawltor left a comment

support pp accuracy calculation #9379

support pp accuracy calculation #9379

Conversation

wtmlon commented Nov 6, 2024

PR types

PR changes

Description

paddle-bot bot commented Nov 6, 2024

CLAassistant commented Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Nov 12, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lugimzzz left a comment

Choose a reason for hiding this comment

lugimzzz left a comment

Choose a reason for hiding this comment

wawltor left a comment

Choose a reason for hiding this comment

CLAassistant commented Nov 6, 2024 •

edited

Loading

codecov bot commented Nov 12, 2024 •

edited

Loading