[Dy2St][CUDAGraph] Set undefined place for CUDAGraph OP outputs before lowering to avoid unnecessary `memcpy` && Add CUDAGraph unitest by DrRyanHuang · Pull Request #75078 · PaddlePaddle/Paddle

DrRyanHuang · 2025-09-04T05:22:34Z

PR Category

Execute Infrastructure

PR Types

Improvements

Description

~~原有的切图逻辑~~ 描述有修改，看后面的 #75078 (comment) 会导致执行器在子图的末尾插入不必要的 OP memcpy_h2d

    (%55, %56, %57, %58, %59, %60) = "pd_op.cuda_graph" [id:3213] () {} : () -> gpu_tensor<-1x1024xbf16>, gpu_tensor<-1x2560xbf16>, gpu_tensor<1xi64>, gpu_tensor<1xi64>, gpu_tensor<i64>, gpu_tensor<-1x2048xbf16>
    {
		......
        (%65) = "embedding(phi_kernel)" (%0, %64) {kernel_key:<backend:GPU|layout:NCHW|dtype:bfloat16>,kernel_name:"embedding",op_name:"pd_op.embedding",origin_id:3218,padding_idx:-1,sparse:false,stop_gradient:[false]} : (gpu_tensor<-1xi64>, gpu_tensor<103424x1024xbf16>) -> gpu_tensor<-1x1024xbf16>
        (%66, %67, %68) = "rms_norm(phi_kernel)" (%65, <<NULL VALUE>>, <<NULL VALUE>>, %63, <<NULL VALUE>>) {begin_norm_axis:1,epsilon:1e-05,kernel_key:<backend:GPU|layout:NCHW|dtype:bfloat16>,kernel_name:"rms_norm",op_name:"pd_op.rms_norm",origin_id:3219,quant_max_bound:0,quant_min_bound:0,quant_round_type:0,quant_scale:-1,stop_gradient:[false,false,false]} : (gpu_tensor<-1x1024xbf16>, <<NULL TYPE>>, <<NULL TYPE>>, gpu_tensor<1024xbf16>, <<NULL TYPE>>) -> gpu_tensor<-1x1024xbf16>, <<NULL TYPE>>, gpu_tensor<-1xf32>
        (%69) = "weight_only_linear(phi_kernel)" (%66, %62, <<NULL VALUE>>, %61) {arch:90,group_size:-1,kernel_key:<backend:GPU|layout:NCHW|dtype:bfloat16>,kernel_name:"weight_only_linear",op_name:"pd_op.weight_only_linear",origin_id:3220,stop_gradient:[false],weight_dtype:"int8"} : (gpu_tensor<-1x1024xbf16>, gpu_tensor<2560x1024xi8>, <<NULL TYPE>>, gpu_tensor<2560xbf16>) -> gpu_tensor<-1x2560xbf16>
        (%70) = "shape64(phi_kernel)" (%69) {kernel_key:<backend:GPU|layout:NCHW|dtype:bfloat16>,kernel_name:"shape64",op_name:"pd_op.shape64",origin_id:3221,stop_gradient:[true]} : (gpu_tensor<-1x2560xbf16>) -> cpu_tensor<2xi64>
        (%71) = "full_int_array(phi_kernel)" () {dtype:int64,kernel_key:<backend:CPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"full_int_array",op_name:"pd_op.full_int_array",origin_id:3222,place:Place(cpu),stop_gradient:[true],value:[0]} : () -> cpu_tensor<1xi64>
        (%72) = "full_int_array(phi_kernel)" () {dtype:int64,kernel_key:<backend:CPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"full_int_array",op_name:"pd_op.full_int_array",origin_id:3223,place:Place(cpu),stop_gradient:[true],value:[1]} : () -> cpu_tensor<1xi64>
        (%73) = "slice(phi_kernel)" (%70, %71, %72) {axes:[0],decrease_axis:[0],infer_flags:[1],kernel_key:<backend:CPU|layout:NCHW|dtype:int64>,kernel_name:"slice",op_name:"pd_op.slice",origin_id:3224,stop_gradient:[true]} : (cpu_tensor<2xi64>, cpu_tensor<1xi64>, cpu_tensor<1xi64>) -> cpu_tensor<i64>
        (%74) = "full(phi_kernel)" () {dtype:int64,kernel_key:<backend:CPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"full",op_name:"pd_op.full",origin_id:3225,place:Place(cpu),shape:[],stop_gradient:[true],value:2048} : () -> cpu_tensor<i64>
		.....
        (%78) = "memcpy_h2d(phi_kernel)" (%71) {dst_place_type:1,kernel_key:<backend:GPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"memcpy_h2d",op_name:"pd_op.memcpy_h2d",origin_id:3229} : (cpu_tensor<1xi64>) -> gpu_tensor<1xi64>
        (%79) = "memcpy_h2d(phi_kernel)" (%72) {dst_place_type:1,kernel_key:<backend:GPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"memcpy_h2d",op_name:"pd_op.memcpy_h2d",origin_id:3230} : (cpu_tensor<1xi64>) -> gpu_tensor<1xi64>
        (%80) = "memcpy_h2d(phi_kernel)" (%74) {dst_place_type:1,kernel_key:<backend:GPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"memcpy_h2d",op_name:"pd_op.memcpy_h2d",origin_id:3231} : (cpu_tensor<i64>) -> gpu_tensor<i64>
        () = "cf.yield" [id:3232] (%65, %69, %78, %79, %80, %77) {origin_id:3120} : (gpu_tensor<-1x1024xbf16>, gpu_tensor<-1x2560xbf16>, gpu_tensor<1xi64>, gpu_tensor<1xi64>, gpu_tensor<i64>, gpu_tensor<-1x2048xbf16>) -> 
    }

原因是我们在外部指定了输出的 place 为 GPUPlace，所以即使最后的 yield OP 输出带有 cpu_tensor，也会被在 lower 过程中插入的 memcpy_h2d OP 搬运到 GPU 上

因此，在 ProcessBlock 之前，我们指定 place 为默认 phi::Place ，其 AllocationType 为 UNDEFINED

class TEST_API Place {
 public:
  Place()
      : device(0), alloc_type_(AllocationType::UNDEFINED), device_type_id_(0) {}

这样执行器就不会向图中插入memcpy_h2d OP，在 ProcessBlock 之后，我们再将 CUDAGraphOp 的输出类型设置回来就行：

  for (size_t i = 0; i < yield_op.num_operands(); ++i) {
    new_cg_op->result(i).set_type(yield_op.operand_type(i));
  }

最终的图如下，不再有 memcpy:

    (%55, %56, %57, %58, %59, %60) = "pd_op.cuda_graph" [id:6315] () {} : () -> gpu_tensor<-1x1024xbf16>, gpu_tensor<-1x2560xbf16>, cpu_tensor<1xi64>, cpu_tensor<1xi64>, cpu_tensor<i64>, gpu_tensor<-1x2048xbf16>
    {
		......
        (%65) = "embedding(phi_kernel)" (%0, %64) {kernel_key:<backend:GPU|layout:NCHW|dtype:bfloat16>,kernel_name:"embedding",op_name:"pd_op.embedding",origin_id:6320,padding_idx:-1,sparse:false,stop_gradient:[false]} : (gpu_tensor<-1xi64>, gpu_tensor<103424x1024xbf16>) -> gpu_tensor<-1x1024xbf16>
        (%66, %67, %68) = "rms_norm(phi_kernel)" (%65, <<NULL VALUE>>, <<NULL VALUE>>, %63, <<NULL VALUE>>) {begin_norm_axis:1,epsilon:1e-05,kernel_key:<backend:GPU|layout:NCHW|dtype:bfloat16>,kernel_name:"rms_norm",op_name:"pd_op.rms_norm",origin_id:6321,quant_max_bound:0,quant_min_bound:0,quant_round_type:0,quant_scale:-1,stop_gradient:[false,false,false]} : (gpu_tensor<-1x1024xbf16>, <<NULL TYPE>>, <<NULL TYPE>>, gpu_tensor<1024xbf16>, <<NULL TYPE>>) -> gpu_tensor<-1x1024xbf16>, <<NULL TYPE>>, gpu_tensor<-1xf32>
        (%69) = "weight_only_linear(phi_kernel)" (%66, %62, <<NULL VALUE>>, %61) {arch:90,group_size:-1,kernel_key:<backend:GPU|layout:NCHW|dtype:bfloat16>,kernel_name:"weight_only_linear",op_name:"pd_op.weight_only_linear",origin_id:6322,stop_gradient:[false],weight_dtype:"int8"} : (gpu_tensor<-1x1024xbf16>, gpu_tensor<2560x1024xi8>, <<NULL TYPE>>, gpu_tensor<2560xbf16>) -> gpu_tensor<-1x2560xbf16>
        (%70) = "shape64(phi_kernel)" (%69) {kernel_key:<backend:GPU|layout:NCHW|dtype:bfloat16>,kernel_name:"shape64",op_name:"pd_op.shape64",origin_id:6323,stop_gradient:[true]} : (gpu_tensor<-1x2560xbf16>) -> cpu_tensor<2xi64>
        (%71) = "full_int_array(phi_kernel)" () {dtype:int64,kernel_key:<backend:CPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"full_int_array",op_name:"pd_op.full_int_array",origin_id:6324,place:Place(cpu),stop_gradient:[true],value:[0]} : () -> cpu_tensor<1xi64>
        (%72) = "full_int_array(phi_kernel)" () {dtype:int64,kernel_key:<backend:CPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"full_int_array",op_name:"pd_op.full_int_array",origin_id:6325,place:Place(cpu),stop_gradient:[true],value:[1]} : () -> cpu_tensor<1xi64>
        (%73) = "slice(phi_kernel)" (%70, %71, %72) {axes:[0],decrease_axis:[0],infer_flags:[1],kernel_key:<backend:CPU|layout:NCHW|dtype:int64>,kernel_name:"slice",op_name:"pd_op.slice",origin_id:6326,stop_gradient:[true]} : (cpu_tensor<2xi64>, cpu_tensor<1xi64>, cpu_tensor<1xi64>) -> cpu_tensor<i64>
        (%74) = "full(phi_kernel)" () {dtype:int64,kernel_key:<backend:CPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"full",op_name:"pd_op.full",origin_id:6327,place:Place(cpu),shape:[],stop_gradient:[true],value:2048} : () -> cpu_tensor<i64>
        (%75) = "builtin.combine" [id:6328] (%73, %74) {origin_id:5988,stop_gradient:[true]} : (cpu_tensor<i64>, cpu_tensor<i64>) -> vec[cpu_tensor<i64>,cpu_tensor<i64>]
        (%76) = "stack(phi_kernel)" (%75) {axis:0,kernel_key:<backend:CPU|layout:NCHW|dtype:int64>,kernel_name:"stack",op_name:"pd_op.stack",origin_id:6329,stop_gradient:[true]} : (vec[cpu_tensor<i64>,cpu_tensor<i64>]) -> cpu_tensor<2xi64>
        (%77) = "empty(phi_kernel)" (%76) {dtype:bfloat16,kernel_key:<backend:GPU|layout:Undefined(AnyLayout)|dtype:bfloat16>,kernel_name:"empty",op_name:"pd_op.empty",origin_id:6330,place:Place(undefined:0),stop_gradient:[true]} : (cpu_tensor<2xi64>) -> gpu_tensor<-1x2048xbf16>
        () = "cf.yield" [id:6331] (%65, %69, %71, %72, %74, %77) {origin_id:6222} : (gpu_tensor<-1x1024xbf16>, gpu_tensor<-1x2560xbf16>, cpu_tensor<1xi64>, cpu_tensor<1xi64>, cpu_tensor<i64>, gpu_tensor<-1x2048xbf16>) -> 
    }

paddle-bot · 2025-09-04T05:22:41Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle/fluid/pir/transforms/pd_op_to_kernel_pass.cc

SigureMo

加个单测吧

codecov-commenter · 2025-09-04T11:24:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@3011278). Learn more about missing BASE report.

Additional details and impacted files

@@             Coverage Diff             @@
##             develop    #75078   +/-   ##
===========================================
  Coverage           ?   100.00%           
===========================================
  Files              ?         1           
  Lines              ?         7           
  Branches           ?         0           
===========================================
  Hits               ?         7           
  Misses             ?         0           
  Partials           ?         0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

test/dygraph_to_static/test_cudagraph.py

SigureMo · 2025-09-08T08:38:44Z

原有的切图逻辑，会导致执行器在子图的末尾插入不必要的 OP memcpy_h2d

重点不在切图，在于 lower，这个 PR 没修改切图逻辑

原因是我们在外部指定了输出的 place 为 GPUPlace

expected place，在当前 case 下 expected place 是 GPU

test/dygraph_to_static/test_cudagraph.py

SigureMo

LGTMeow

test/dygraph_to_static/test_cudagraph.py

SigureMo · 2025-09-11T11:07:37Z

test/dygraph_to_static/CMakeLists.txt

@@ -9,6 +9,8 @@ set(SOT_ENVS SOT_LOG_LEVEL=0 MIN_GRAPH_SIZE=0 STRICT_MODE=False
 # swgu98: Temporarily commented on Windows platform
 if(WIN32)
  list(REMOVE_ITEM TEST_OPS test_for_enumerate)


@swgu98 这里记得修一下

…e lowering to avoid unnecessary `memcpy` (PaddlePaddle#75078)

GPU place -> base Place

a7575bb

DrRyanHuang requested a review from SigureMo September 4, 2025 05:22

SigureMo reviewed Sep 4, 2025

View reviewed changes

paddle/fluid/pir/transforms/pd_op_to_kernel_pass.cc Outdated Show resolved Hide resolved

paddle/fluid/pir/transforms/pd_op_to_kernel_pass.cc Outdated Show resolved Hide resolved

paddle/fluid/pir/transforms/pd_op_to_kernel_pass.cc Outdated Show resolved Hide resolved

update comment

c712f97

SigureMo reviewed Sep 4, 2025

View reviewed changes

DrRyanHuang mentioned this pull request Sep 4, 2025

[SOT][Cudagraph] Remove BreakGraph of #3302 && update CustomOp PaddlePaddle/FastDeploy#3694

Merged

add unitest

9ef6d97

DrRyanHuang requested review from gouzil and zrr1999 as code owners September 8, 2025 06:19

2020 -> 2025

7d0ca66

SigureMo reviewed Sep 8, 2025

View reviewed changes

SigureMo changed the title ~~Use pir::Place in CudaGraphOp output to avoid memcpy~~ [Dy2St][CUDAGraph] Set undefined place for CUDAGraph OP outputs before lowering to avoid unnecessary memcpy Sep 8, 2025

add other unitests

9e38dbe

DrRyanHuang changed the title ~~[Dy2St][CUDAGraph] Set undefined place for CUDAGraph OP outputs before lowering to avoid unnecessary memcpy~~ [Dy2St][CUDAGraph] Set undefined place for CUDAGraph OP outputs before lowering to avoid unnecessary memcpy Sep 8, 2025

update unitest && roll back if for coverage

b3594c1

DrRyanHuang commented Sep 10, 2025

View reviewed changes

test/dygraph_to_static/test_cudagraph.py Outdated Show resolved Hide resolved

DrRyanHuang added 2 commits September 10, 2025 18:45

Merge branch 'PaddlePaddle:develop' into cg

8116081

update reason

20a2db6

SigureMo previously approved these changes Sep 11, 2025

View reviewed changes

test/dygraph_to_static/test_cudagraph.py Outdated Show resolved Hide resolved

remove test_cudagraph on win32

68a7dc6

DrRyanHuang dismissed SigureMo’s stale review via 68a7dc6 September 11, 2025 09:22

gouzil previously approved these changes Sep 11, 2025

View reviewed changes

SigureMo previously approved these changes Sep 11, 2025

View reviewed changes

kolinwei previously approved these changes Sep 11, 2025

View reviewed changes

skip dcu

970c194

DrRyanHuang dismissed stale reviews from kolinwei and SigureMo via 970c194 September 12, 2025 04:49

DrRyanHuang dismissed gouzil’s stale review via 970c194 September 12, 2025 04:49

SigureMo approved these changes Sep 12, 2025

View reviewed changes

luotao1 approved these changes Sep 15, 2025

View reviewed changes

DrRyanHuang merged commit 2feb9e4 into PaddlePaddle:develop Sep 15, 2025
73 of 76 checks passed

DrRyanHuang deleted the cg branch September 15, 2025 02:39

co63oc pushed a commit to co63oc/Paddle that referenced this pull request Sep 18, 2025

[Dy2St][CUDAGraph] Set undefined place for CUDAGraph OP outputs befor…

d112f57

…e lowering to avoid unnecessary `memcpy` (PaddlePaddle#75078)

DrRyanHuang mentioned this pull request Oct 14, 2025

[SOT][CUDAGraph] Add support for custom all-reduce operators under SOT mode PaddlePaddle/FastDeploy#4386

Merged

DrRyanHuang changed the title ~~[Dy2St][CUDAGraph] Set undefined place for CUDAGraph OP outputs before lowering to avoid unnecessary memcpy~~ [Dy2St][CUDAGraph] Set undefined place for CUDAGraph OP outputs before lowering to avoid unnecessary memcpy && Add CUDAGraph unitest Oct 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dy2St][CUDAGraph] Set undefined place for CUDAGraph OP outputs before lowering to avoid unnecessary `memcpy` && Add CUDAGraph unitest#75078

[Dy2St][CUDAGraph] Set undefined place for CUDAGraph OP outputs before lowering to avoid unnecessary `memcpy` && Add CUDAGraph unitest#75078
DrRyanHuang merged 10 commits intoPaddlePaddle:developfrom
cattidea:cg

DrRyanHuang commented Sep 4, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Sep 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SigureMo left a comment

Uh oh!

codecov-commenter commented Sep 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SigureMo commented Sep 8, 2025

Uh oh!

Uh oh!

SigureMo left a comment

Uh oh!

Uh oh!

SigureMo Sep 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

DrRyanHuang commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Uh oh!

paddle-bot bot commented Sep 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SigureMo left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SigureMo commented Sep 8, 2025

Uh oh!

Uh oh!

SigureMo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SigureMo Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

DrRyanHuang commented Sep 4, 2025 •

edited

Loading

codecov-commenter commented Sep 4, 2025 •

edited

Loading