Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[NewComm] No.10 compatiable upgrade for distributed_fused_lamb op #57424

Merged
merged 3 commits into from
Sep 21, 2023

Conversation

BeingGod
Copy link
Contributor

PR types

Others

PR changes

APIs

Description

compatiable upgrade for distributed_fused_lamb op
#57102

@BeingGod
Copy link
Contributor Author

单测目前存在问题,distributed_fused_lamb_test_base.py中采用fleet.init初始化并行环境,导致comm_context_manager找不到ring_id。麻烦看一下是应该在fleet.init中加入新的通信库初始化还是修改单测? @GhostScreaming

@paddle-bot paddle-bot bot added the contributor External developers label Sep 17, 2023
@luotao1 luotao1 added the HappyOpenSource 快乐开源活动issue与PR label Sep 18, 2023
@GhostScreaming
Copy link
Contributor

换一个初始化方式。如果使用新通信库,使用paddle.distributed.collective._init_parallel_env("nccl")的方式进行初始化。

@@ -270,7 +270,10 @@ def setUpClass(cls):
paddle.enable_static()
paddle.set_flags({'FLAGS_cudnn_deterministic': True})
_clip_by_global_norm_using_mp_type(True)
fleet.init(role_maker=get_role_maker())
if os.environ.get("FLAGS_dynamic_static_unified_comm") == "1":
fleet.init(role_maker=get_role_maker())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

判断写反了吧,设置FLAGS_dynamic_static_unified_comm = 1 的时候,应该用paddle.distributed.collective._init_parallel_env("nccl")的方式初始化。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -228,5 +231,19 @@ void NCCLCommContext::GroupStart() {
}
void NCCLCommContext::GroupEnd() { NCCL_CHECK(phi::dynload::ncclGroupEnd()); }

#if NCCL_VERSION_CODE >= 21100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里加一点注释信息吧,解释一下这个函数是干啥的,直接看名字很难弄懂功能。可以附上链接:https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/ops.html,把里面Op功能的解释,整点到注释里。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@BeingGod BeingGod force-pushed the comm_distributed_fused_lamb branch from 4f2badd to 8a29c88 Compare September 20, 2023 04:38
Copy link

@hitywt hitywt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@GhostScreaming GhostScreaming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@luotao1 luotao1 merged commit 3fd69fa into PaddlePaddle:develop Sep 21, 2023
iosmers pushed a commit to iosmers/Paddle that referenced this pull request Sep 21, 2023
…ddlePaddle#57424)

* [NewComm] No.10 compatiable upgrade for distributed_fused_lamb op

* fix
@BeingGod BeingGod deleted the comm_distributed_fused_lamb branch September 25, 2023 11:39
Frida-a pushed a commit to Frida-a/Paddle that referenced this pull request Oct 14, 2023
…ddlePaddle#57424)

* [NewComm] No.10 compatiable upgrade for distributed_fused_lamb op

* fix
jiahy0825 pushed a commit to jiahy0825/Paddle that referenced this pull request Oct 16, 2023
…ddlePaddle#57424)

* [NewComm] No.10 compatiable upgrade for distributed_fused_lamb op

* fix
danleifeng pushed a commit to danleifeng/Paddle that referenced this pull request Nov 14, 2023
…ddlePaddle#57424)

* [NewComm] No.10 compatiable upgrade for distributed_fused_lamb op

* fix
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
contributor External developers HappyOpenSource 快乐开源活动issue与PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants