[Auto-parallel] 增强stage对层间传递stop_grad=True的参数的支持 #73459

zty-king · 2025-06-19T06:28:12Z

PR Category

Auto Parallel

PR Types

Improvements

Description

当前问题

use_flash_attention设置为false时，此时跑llama2_13b_hybrid_pp会报错，主要原因在于forward输出到下一个stage的output数，和backward时从下一个stage接收到的gard数不相等，导致backward无法正确计算。

问题分析

由于在EmbeddingLayer中计算出来的参数需要在forward过程中不断传递，并在每个DecoderLayer中做相关计算，但是注意，除了hidden_states，其它参数在EmbeddingLayer计算之后，只在层间传递，辅助计算。

解决思路
在每层获取输出时，对其过滤，对于stop_grad为True的参数，进行过滤，并适配相关代码，只在层间传递，不做backward

paddle-bot · 2025-06-19T06:28:17Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

xuxinyi389 · 2025-06-19T11:14:08Z

python/paddle/distributed/auto_parallel/pipelining/stage.py

            # We assume we always send to stage + 1
            if not self.is_last:
                self.act_send_info[idx] = [self.stage_index + 1]
+                if not outputs_meta[idx].stop_gradient:


TensorMeta 没有这个属性

paddle-ci-bot · 2025-06-30T02:51:51Z

Sorry to inform you that ea80cdb's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

… enhance_stage

codecov-commenter · 2025-08-01T11:32:52Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@b5d6a16). Learn more about missing BASE report.

Additional details and impacted files

@@             Coverage Diff             @@
##             develop    #73459   +/-   ##
===========================================
  Coverage           ?   100.00%           
===========================================
  Files              ?         2           
  Lines              ?        12           
  Branches           ?         0           
===========================================
  Hits               ?        12           
  Misses             ?         0           
  Partials           ?         0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

xuxinyi389 · 2025-08-03T07:40:42Z

python/paddle/distributed/auto_parallel/pipelining/stage.py

@@ -710,9 +709,19 @@ def forward_one_chunk(
        flat_args = _flatten_args(input_args)
        flat_kwargs = _flatten_args(composite_kwargs)
        flatten_input_tensors = flat_args + flat_kwargs
+        grad_required_output_tuple = tuple(


命名似有问题，grad开头，给人误解为是grad数据。命名为 requires_grad_output_tuple更好

xuxinyi389 · 2025-08-03T07:40:59Z

python/paddle/distributed/auto_parallel/pipelining/stage.py

+            for out in output_tuple
+            if isinstance(out, paddle.Tensor) and not out.stop_gradient
+        )
+        grad_required_flatten_input_tensors = [


zty-king · 2025-08-03T10:44:15Z

/re-run all-failed

xuxinyi389

LGTM

增强stage对层间传递stop_grad=True的参数的支持

f1c40e1

paddle-bot bot added the contributor External developers label Jun 19, 2025

change the name for grad_recv_indices,fix the vpp hang

69c0518

xuxinyi389 reviewed Jun 19, 2025

View reviewed changes

为TensorMeta添加stop_gradient属性

ea80cdb

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

7511958

… enhance_stage

xuxinyi389 reviewed Aug 3, 2025

View reviewed changes

Modify variable naming

10f31c2

zty-king changed the title ~~增强stage对层间传递stop_grad=True的参数的支持~~ [Auto-parallel] 增强stage对层间传递stop_grad=True的参数的支持 Aug 3, 2025

xuxinyi389 approved these changes Aug 4, 2025

View reviewed changes

xuxinyi389 merged commit 388eaa7 into PaddlePaddle:develop Aug 4, 2025
82 of 84 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Auto-parallel] 增强stage对层间传递stop_grad=True的参数的支持 #73459

[Auto-parallel] 增强stage对层间传递stop_grad=True的参数的支持 #73459

Uh oh!

zty-king commented Jun 19, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Jun 19, 2025

Uh oh!

xuxinyi389 Jun 19, 2025

Uh oh!

zty-king Jun 19, 2025

Uh oh!

paddle-ci-bot bot commented Jun 30, 2025

Uh oh!

codecov-commenter commented Aug 1, 2025

Uh oh!

xuxinyi389 Aug 3, 2025

Uh oh!

zty-king Aug 3, 2025

Uh oh!

xuxinyi389 Aug 3, 2025

Uh oh!

zty-king Aug 3, 2025

Uh oh!

zty-king commented Aug 3, 2025

Uh oh!

xuxinyi389 left a comment

Uh oh!

Uh oh!

Uh oh!

[Auto-parallel] 增强stage对层间传递stop_grad=True的参数的支持 #73459

[Auto-parallel] 增强stage对层间传递stop_grad=True的参数的支持 #73459

Uh oh!

Conversation

zty-king commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Uh oh!

paddle-bot bot commented Jun 19, 2025

Uh oh!

xuxinyi389 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

zty-king Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

paddle-ci-bot bot commented Jun 30, 2025

Uh oh!

codecov-commenter commented Aug 1, 2025

Codecov Report

Uh oh!

xuxinyi389 Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

zty-king Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

xuxinyi389 Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

zty-king Aug 3, 2025

Choose a reason for hiding this comment

Uh oh!

zty-king commented Aug 3, 2025

Uh oh!

xuxinyi389 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zty-king commented Jun 19, 2025 •

edited

Loading