Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

【PIR】PIR下的分布式算子注册 #60436

Closed
xingmingyyj opened this issue Dec 28, 2023 · 11 comments
Closed

【PIR】PIR下的分布式算子注册 #60436

xingmingyyj opened this issue Dec 28, 2023 · 11 comments
Assignees
Labels
PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc status/close 已关闭

Comments

@xingmingyyj
Copy link
Contributor

xingmingyyj commented Dec 28, 2023

一、需求背景

飞桨正在构建一套新的IR体系.在新IR下飞桨基于动态图的更规范的算子定义(ops.yaml、legacy_ops.yaml)生成了新IR体系下的算子.在新的IR体系下仍然需要保证旧IR的兼容性.为此飞桨提供了ProgramTranslator(相关代码位于paddle/fluid/ir_adaptor/translator/),它可以将旧IR表示下的计算图翻译为新IR下的计算图.目前,ProgramTranslator的核心工作是完成单个OP的翻译.也就是将旧IR下定义的OP(一般定义在paddle/fluid/operators文件夹下)翻译为新IR下定义的算子.

现在有一部分分布式算子在新IR下是没有定义的.我们需要在新IR下为它们补充定义并保证ProgramTranslator可以成功完成翻译.

需要注册的分布式算子如下:

序号 单测 认领人/状态/PR号
1 push_sparse_v2 @enkilee #60473
2 distributed_push_sparse @enkilee #60805
3 c_allreduce_min @enkilee #60584
4 global_scatter @xiaoyewww #62579
5 partial_allgather @xiaoyewww #62735
6 c_scatter @DrRyanHuang
@enkilee #62369
7 c_reduce_prod @DrRyanHuang
@enkilee #62270
8 dgc @xiaoyewww #62781
9 partial_recv @enkilee #62412
10 pull_gpups_sparse @xiaoyewww #62935
11 dgc_momentum @xiaoyewww #63013
12 all_reduce @xiaoyewww #62634
13 partial_send @Difers #60484
14 send_and_recv @Difers #62589
@xiaoyewww #64203
15 push_dense @Difers
@enkilee #62505
16 c_split @DrRyanHuang
@enkilee #62416
17 barrier @xiaoyewww #62802
18 lars_momentum @enkilee #60838
19 pull_box_sparse @LittleNoob2333
@enkilee #62982
20 global_gather @Eacient
@xingmingyyj #63867
21 c_allreduce_prod @enkilee #60790
22 pull_sparse_v2 @xiaoyewww #63014
23 c_reduce_max @enkilee #62270
24 distributed_lookup_table @xiaoyewww #60911
25 distributed_fused_lamb_init @xiaoyewww #62050
26 limit_by_capacity @xiaoyewww #62579
27 distributed_fused_lamb @enkilee #61293
28 random_routing @xiaoyewww #62443 #62781
29 prune_gate_by_capacity @xiaoyewww #62494
30 nop @xiaoyewww #62541

PR提交模板

  • PR标题
【PIR Dist Op Reg No.1】 reg c_reduce_min
  • PR内容
### PR types
Others

### PR changes
Others

### Description


注册算子 `c_reduce_min`

认领方式

请大家以 comment 的形式认领任务,如:

【报名】:1、3、12-13

多个任务之间需要使用中文顿号分隔,报名多个连续任务可用横线表示,如 2-5
PR 提交格式:在 PR 的标题中以 【PIR OpTest Fix No.xxx】 开头,注明任务编号

看板信息

任务方向 任务数量 提交作品 / 任务认领 提交率 完成 完成率
快乐开源 30 29 / 29 96.67% 29 96.67%

二、Tutorial

每个任务的主要工作可以分为

  • 注册算子
  • 编写单测
  • 修改test/ir/pir/translator/CMakeLists.txt

三个部分,下面展开介绍:

2.1 算子注册

关于算子注册的步骤可以参考 #59382二、Tutorial.

2.2 编写单测

为了验证我们新注册的分布式算子可以被成功的翻译.需要编写一个单测进行验证.

首先,编写的所有单测需要放置在test/ir/pir/translator文件夹下,并且继承 TestOpTranscriber. 并且继承TestOpTranslatorTestOpWithBackwardTranslator,对于只需要注册前向算子的单测需要继承TestOpTranslator,前向和反向算子同时注册时需要继承TestOpWithBackwardTranslator.

class TestOpTranslator(unittest.TestCase):
    def setUp(self):
        self.place = core.Place()
        self.place.set_place(paddle.CPUPlace())
        self.new_scope = paddle.static.Scope()
        self.main_program = paddle.static.Program()

    def append_op(self):
        raise Exception("Define the op to be tested here!")

    def build_model(self):
        with paddle.static.scope_guard(self.new_scope):
            with paddle.static.program_guard(self.main_program):
                self.append_op()

    def check(self):
        self.build_model()
        l = pir.translate_to_pir(self.main_program.desc)
        assert hasattr(self, "op_type"), "Op_type should be specified!"
        assert self.op_type in str(l), (
            self.op_type
            + " should be translated to pd_op."
            + self.op_type
            + '!'
        )

继承TestOpTranscribe时, 继承TestOpTranslator时,需要重写append_op方法,在组网时将待测试的Op加入.check的主要思路是将旧IR下表示的计算图使用ProgramTranslator翻译为新IR表示的计算图,然后将新IR表示的计算图进行打印,如果计算图中包含待注册的Op,则说明翻译成功.
这里的类名统一采用TestXXXOpTranslator的形式,

class TestCReduceMinOpTranslator(test_op_transcriber.TestOpTranslator):
    def append_op(self):
        self.op_type = "c_reduce_min"
        x = paddle.ones(shape=(100, 2, 3), dtype='float32')
        y = paddle.ones(shape=(100, 2, 3), dtype='float32')
        attrs = {'ring_id': 0, 'root_id': 0, 'use_calc_stream': False}
        helper = LayerHelper(self.op_type)
        helper.append_op(
            type=self.op_type,
            inputs={"X": x},
            outputs={"Out": y},
            attrs=attrs,
        )

    def test_translator(self):
        self.check()


if __name__ == "__main__":
    unittest.main()

上述代码是对c_reduce_min进行测试的例子.

2.3 修改test/ir/pir/translator/CMakeLists.txt

因为现在注册的是分布式算子,如果编译选项WITH_DISTRIBUTE不打开的话,这部分算子是不会被编译注册的.所以,即便完成上述操作在某些CI上仍然可能遇到下述问题:

ValueError: Operator "xxx" has not been registered.

解决方法是修改CMakeLists.

file(
  GLOB TEST_INTERP_CASES
  RELATIVE "${CMAKE_CURRENT_SOURCE_DIR}"
  "test_*.py")
string(REPLACE ".py" "" TEST_INTERP_CASES "${TEST_INTERP_CASES}")

set(DISTRIBUTED_OP_TRANSLATOR_TEST test_c_reduce_min_translator)

if(NOT WITH_DISTRIBUTE)
  list(REMOVE_ITEM TEST_INTERP_CASES ${DISTRIBUTED_OP_TRANSLATOR_TEST})
endif()

foreach(target ${TEST_INTERP_CASES})
  py_test_modules(${target} MODULES ${target})
endforeach()

可以看出DISTRIBUTED_OP_TRANSLATOR_TEST中记录了分布式算子对应的单测,在WITH_DISTRIBUTE选项没有打开时,这些单测将会从TEST_INTERP_CASES删除,这样在CI上就不会执行该单测了.
c_allreduce_min这个算子为例,单测名称对应为test_c_allreduce_min_translator,所以,

set(DISTRIBUTED_OP_TRANSLATOR_TEST test_c_reduce_min_translator
                                   test_c_allreduce_min_translator)

将对应单测名称加入集合就可以了.

三、Q&A

1.反向算子定义的位置?

A:取决于前向算子定义的位置.如果前向定义在paddle/phi/api/yaml/ops.yaml, 反向就需要定义在 paddle/phi/api/yaml/backward.yaml.如果前向定义在 paddle/fluid/pir/dialect/operator/ir/ops.yaml,就把反向定义在paddle/fluid/pir/dialect/operator/ir/ops_backward.yaml.

统计信息

排名不分先后 @enkilee (12) @xiaoyewww (15) @Difers (1) @xingmingyyj (1)

@DrRyanHuang
Copy link
Member

【报名】:6、7、16

@enkilee
Copy link
Contributor

enkilee commented Dec 28, 2023

【报名】:1、3

@xiaoyewww
Copy link
Contributor

【报名】:24、25

@paddle-bot paddle-bot bot added the PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc label Dec 28, 2023
@Difers
Copy link
Contributor

Difers commented Dec 30, 2023

【报名】:13、14、15

@xiaoyewww
Copy link
Contributor

【报名】12、17、26

@PaddlePaddle PaddlePaddle deleted a comment from sanbuphy Mar 8, 2024
@xiaoyewww
Copy link
Contributor

【报名】4、5、8、10、11

@Eacient
Copy link

Eacient commented Mar 18, 2024

【报名】:20

@LittleNoob2333
Copy link

【报名】:19

@xiaoyewww
Copy link
Contributor

【报名】:22

@luotao1
Copy link
Contributor

luotao1 commented May 11, 2024

【PIR】PIR下的分布式算子注册 已全部完成,感谢参与的小伙伴们!

排名不分先后 @enkilee (12) @xiaoyewww (15) @Difers (1) @xingmingyyj (1)

欢迎继续参与快乐开源的其他任务

@luotao1 luotao1 closed this as completed May 11, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Call for Contributions May 11, 2024
@paddle-bot paddle-bot bot added the status/close 已关闭 label May 11, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
PFCC Paddle Framework Contributor Club,https://github.com/PaddlePaddle/community/tree/master/pfcc status/close 已关闭
Projects
Development

No branches or pull requests