Skip to content

[Auto Parallel] Add general gradient merge pass to support auto parallel #38259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Dec 31, 2021
Merged

Conversation

xymyeah
Copy link
Contributor

@xymyeah xymyeah commented Dec 18, 2021

PR types

New features

PR changes

Others

Describe

[Auto Parallel] add gradient merge pass
Refer to for the results of precision alignment : https://github.com/xymyeah/gradient_merge_precision_alignment

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@@ -0,0 +1,349 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename gradient_merge.py to auto_parallel_gradient_merge.py since this pass may not work for other codes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return optimize_ops_desc


def _remove_op_role_var(param, grad):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of _remove_op_role_var?

Copy link
Contributor Author

@xymyeah xymyeah Dec 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

非自动并行的情况下,多卡训练时,是通过“op_role_var”这一变量来记录通信的Var(标记哪些是需要通信的梯度),增加grad merge后,原op中记录的op_role_var是错误的,需要删除,同时相应的optimizer op需要增加相应op_role_var,便于后续对grad进行allreduce merge

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

给op_role_var 添加allreduce 是PE的逻辑,自动并行不走PE, 是通过每个dist op 自己判断是否需要梯度同步 和同步的dp world (可能存在多个dp world 的情况)。op_role_var 在自动并行是不生效的(因为无法区分多个 dp world 的情况)。所以都不需要添加

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已将program中的op_role_var参数删除

aoyulong
aoyulong previously approved these changes Dec 28, 2021
_add_gm_op_role_var(new_grad_op, param, gradient_merge_var,
cond_var_name)
new_params_grads.append([param, gradient_merge_var])
return new_params_grads, param_to_gradient_merge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to rename new_params_grads to new_params_to_grads as param_to_gradient_merge to explicitly indicate a dict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

JZ-LIANG
JZ-LIANG previously approved these changes Dec 30, 2021
Copy link
Contributor

@JZ-LIANG JZ-LIANG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xymyeah xymyeah changed the title [Auto Parallel] add gradient merge pass [Auto Parallel] Add general gradient merge pass to support auot parallel Dec 30, 2021
@xymyeah xymyeah changed the title [Auto Parallel] Add general gradient merge pass to support auot parallel [Auto Parallel] Add general gradient merge pass to support auto parallel Dec 31, 2021
@JZ-LIANG JZ-LIANG merged commit 89ce6db into PaddlePaddle:develop Dec 31, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants