[Auto Parallel] Add general gradient merge pass to support auto parallel #38259

xymyeah · 2021-12-18T15:19:26Z

PR types

New features

PR changes

Others

Describe

[Auto Parallel] add gradient merge pass
Refer to for the results of precision alignment : https://github.com/xymyeah/gradient_merge_precision_alignment

paddle-bot-old · 2021-12-18T15:19:30Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

aoyulong · 2021-12-28T11:25:00Z

python/paddle/distributed/passes/gradient_merge.py

@@ -0,0 +1,349 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.


Please rename gradient_merge.py to auto_parallel_gradient_merge.py since this pass may not work for other codes.

aoyulong · 2021-12-28T11:34:53Z

python/paddle/distributed/passes/gradient_merge.py

+    return optimize_ops_desc
+
+
+def _remove_op_role_var(param, grad):


What is the purpose of _remove_op_role_var?

非自动并行的情况下，多卡训练时，是通过“op_role_var”这一变量来记录通信的Var（标记哪些是需要通信的梯度），增加grad merge后，原op中记录的op_role_var是错误的，需要删除，同时相应的optimizer op需要增加相应op_role_var，便于后续对grad进行allreduce merge

给op_role_var 添加allreduce 是PE的逻辑，自动并行不走PE，是通过每个dist op 自己判断是否需要梯度同步和同步的dp world （可能存在多个dp world 的情况）。op_role_var 在自动并行是不生效的（因为无法区分多个 dp world 的情况）。所以都不需要添加

已将program中的op_role_var参数删除

aoyulong · 2021-12-28T11:56:01Z

python/paddle/distributed/passes/gradient_merge.py

+        _add_gm_op_role_var(new_grad_op, param, gradient_merge_var,
+                            cond_var_name)
+        new_params_grads.append([param, gradient_merge_var])
+    return new_params_grads, param_to_gradient_merge


It is better to rename new_params_grads to new_params_to_grads as param_to_gradient_merge to explicitly indicate a dict.

JZ-LIANG

LGTM

[Auto Parallel] add gradient merge pass

e8ac84c

xymyeah added 10 commits December 21, 2021 19:32

fix ci issue

6286690

fix ci issue

6bd2713

fix ci issue

2ba7207

fix ci issue

ce649eb

fix ci issue

f2833f5

Merge remote-tracking branch 'upstream/develop' into develop

67884d2

fix ci issue

dd16db4

fix ci issue

92e073b

fix ci issue

11dcfb4

fix ci issue

f905bb3

aoyulong approved these changes Dec 28, 2021

View reviewed changes

aoyulong previously approved these changes Dec 28, 2021

View reviewed changes

xymyeah added 2 commits December 29, 2021 17:05

fix pr review

32758e5

fix pr review

50ecc36

xymyeah dismissed aoyulong’s stale review via 50ecc36 December 29, 2021 09:38

xymyeah and others added 3 commits December 29, 2021 19:42

fix pr review

08a7ba0

Merge branch 'develop' into develop

df00710

fix pr review

c30aab1

JZ-LIANG previously approved these changes Dec 30, 2021

View reviewed changes

fix pr review

54419c7

xymyeah dismissed JZ-LIANG’s stale review via 54419c7 December 30, 2021 04:37

xymyeah and others added 2 commits December 30, 2021 12:40

fix pr review

7fefec3

Merge branch 'develop' into develop

e30afc4

xymyeah changed the title ~~[Auto Parallel] add gradient merge pass~~ [Auto Parallel] Add general gradient merge pass to support auot parallel Dec 30, 2021

xymyeah changed the title ~~[Auto Parallel] Add general gradient merge pass to support auot parallel~~ [Auto Parallel] Add general gradient merge pass to support auto parallel Dec 31, 2021

JZ-LIANG approved these changes Dec 31, 2021

View reviewed changes

JZ-LIANG merged commit 89ce6db into PaddlePaddle:develop Dec 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Auto Parallel] Add general gradient merge pass to support auto parallel #38259

[Auto Parallel] Add general gradient merge pass to support auto parallel #38259

Uh oh!

xymyeah commented Dec 18, 2021 •

edited

Loading

Uh oh!

paddle-bot-old bot commented Dec 18, 2021

Uh oh!

aoyulong Dec 28, 2021

Uh oh!

xymyeah Dec 29, 2021

Uh oh!

aoyulong Dec 28, 2021

Uh oh!

xymyeah Dec 29, 2021 •

edited

Loading

Uh oh!

JZ-LIANG Dec 30, 2021

Uh oh!

xymyeah Dec 30, 2021

Uh oh!

aoyulong Dec 28, 2021

Uh oh!

xymyeah Dec 29, 2021

Uh oh!

JZ-LIANG left a comment

Uh oh!

Uh oh!

		@@ -0,0 +1,349 @@
		# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.

		return optimize_ops_desc


		def _remove_op_role_var(param, grad):

[Auto Parallel] Add general gradient merge pass to support auto parallel #38259

[Auto Parallel] Add general gradient merge pass to support auto parallel #38259

Uh oh!

Conversation

xymyeah commented Dec 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Describe

Uh oh!

paddle-bot-old bot commented Dec 18, 2021

Uh oh!

aoyulong Dec 28, 2021

Choose a reason for hiding this comment

Uh oh!

xymyeah Dec 29, 2021

Choose a reason for hiding this comment

Uh oh!

aoyulong Dec 28, 2021

Choose a reason for hiding this comment

Uh oh!

xymyeah Dec 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JZ-LIANG Dec 30, 2021

Choose a reason for hiding this comment

Uh oh!

xymyeah Dec 30, 2021

Choose a reason for hiding this comment

Uh oh!

aoyulong Dec 28, 2021

Choose a reason for hiding this comment

Uh oh!

xymyeah Dec 29, 2021

Choose a reason for hiding this comment

Uh oh!

JZ-LIANG left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xymyeah commented Dec 18, 2021 •

edited

Loading

xymyeah Dec 29, 2021 •

edited

Loading