Skip to content

Support capture backward subgraph#76032

Merged
DanielSun11 merged 8 commits intoPaddlePaddle:developfrom
DanielSun11:backward_graph_debug
Oct 30, 2025
Merged

Support capture backward subgraph#76032
DanielSun11 merged 8 commits intoPaddlePaddle:developfrom
DanielSun11:backward_graph_debug

Conversation

@DanielSun11
Copy link
Contributor

@DanielSun11 DanielSun11 commented Oct 24, 2025

PR Category

User Experience

PR Types

New features

Description

支持捕获反向的子图

用法:

import paddle
from paddle.base.framework import capture_backward_subgraph_guard
x = paddle.randn([2,3],dtype="float32")
y = paddle.randn([2,3],dtype="float64")
x.stop_gradient = False
y.stop_gradient = False
z1 = x - y
z2 = x + y
# 只捕获guard中的前向所对应的反向图子图,而且需要导出子图中的Grad 
with capture_backward_subgraph_guard(dump_dir_path="./debug",need_dump_grad_tensors =True):
    z3 = z1 * z2
    z4 = z3.sum()
    z5 = z2.mean()
loss = z4 + z5
loss.sum().backward()
print(x.grad)

dump_dir_path 表示要输出的反向图、前向图、调用栈、所有的梯度Tensor所输出的路径
need_dump_grad_tensors 表示是否需要导出梯度(反向图中的边)

导出的反向图子图:

  • 其中灰色的GradNode是指在反向图子图中的节点
  • 橙色高亮的GradNode是指和捕获的子图输入/输出相关的GradNode,它并不在子图中但是和子图梯度传播有关(通常是子图的输入和输出)
image

反向图子图的梯度信息:

Variable: sum1_out_float64_@Grad
  - lod: {}
  - place: Place(gpu:0)
  - shape: []
  - layout: NCHW
  - dtype: float64
  - data: [1.000000]
Variable: mean1_out_float64_@Grad
  - lod: {}
  - place: Place(gpu:0)
  - shape: []
  - layout: NCHW
  - dtype: float64
  - data: [1.000000]
Variable: multiply1_out_float64_2x3@Grad
  - lod: {}
  - place: Place(gpu:0)
  - shape: [2, 3]
  - layout: NCHW
  - dtype: float64
  - data: [1.000000 1.000000 1.000000 1.000000 1.000000 1.000000]
Variable: add1_out_float64_2x3@Grad
  - lod: {}
  - place: Place(gpu:0)
  - shape: [2, 3]
  - layout: NCHW
  - dtype: float64
  - data: [0.166667 0.166667 0.166667 0.166667 0.166667 0.166667]
Variable: subtract1_out_float64_2x3@Grad
  - lod: {}
  - place: Place(gpu:0)
  - shape: [2, 3]
  - layout: NCHW
  - dtype: float64
  - data: [-2.869025 0.629349 -0.607799 -0.698559 1.029199 -3.619529]
Variable: add1_out_float64_2x3@Grad
  - lod: {}
  - place: Place(gpu:0)
  - shape: [2, 3]
  - layout: NCHW
  - dtype: float64
  - data: [1.843187 -2.461634 -0.024330 0.309245 1.282574 1.534915]

注意事项

  • 使用了capture_backward_subgraph_guard之后,backward、grad的dump_backward_graph_path参数将不再发挥作用。也就是说使用了capture_backward_subgraph_guard之后导出整张反向图的功能会暂时失效。
  • capture_backward_subgraph_guard 在运行过程中只能使用一次,如果调用了两次guard可能会导致输出的文件错乱(通常会输出到最后一次调用guard指定的dir中)
  • 建议export FLAGS_enable_unique_name=True,Tensor会有唯一命名,让梯度信息更直观

@paddle-bot
Copy link

paddle-bot bot commented Oct 24, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@codecov-commenter
Copy link

codecov-commenter commented Oct 25, 2025

Codecov Report

❌ Patch coverage is 92.36641% with 10 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@b4a5484). Learn more about missing BASE report.

Files with missing lines Patch % Lines
paddle/fluid/eager/utils.cc 87.50% 7 Missing ⚠️
paddle/fluid/eager/backward.cc 94.23% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop   #76032   +/-   ##
==========================================
  Coverage           ?   92.36%           
==========================================
  Files              ?        4           
  Lines              ?      131           
  Branches           ?        0           
==========================================
  Hits               ?      121           
  Misses             ?       10           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

// Set for Record Subgraph
if (egr::EagerBackwardSubGraphNodeRecorder::Instance()
.NeedCaptureSubGraph()) {
VLOG(3) << "Capture the grad node" << grad_node->name() << "("
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VLOG(10) or VLOG(8)

@DanielSun11 DanielSun11 merged commit 087427e into PaddlePaddle:develop Oct 30, 2025
77 of 80 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants