Skip to content

[Dy2St][CUDAGraph] Set undefined place for CUDAGraph OP outputs before lowering to avoid unnecessary memcpy && Add CUDAGraph unitest#75078

Merged
DrRyanHuang merged 10 commits intoPaddlePaddle:developfrom
cattidea:cg
Sep 15, 2025
Merged

Conversation

@DrRyanHuang
Copy link
Contributor

@DrRyanHuang DrRyanHuang commented Sep 4, 2025

PR Category

Execute Infrastructure

PR Types

Improvements

Description

原有的切图逻辑 描述有修改,看后面的 #75078 (comment) 会导致执行器在子图的末尾插入不必要的 OP memcpy_h2d

    (%55, %56, %57, %58, %59, %60) = "pd_op.cuda_graph" [id:3213] () {} : () -> gpu_tensor<-1x1024xbf16>, gpu_tensor<-1x2560xbf16>, gpu_tensor<1xi64>, gpu_tensor<1xi64>, gpu_tensor<i64>, gpu_tensor<-1x2048xbf16>
    {
		......
        (%65) = "embedding(phi_kernel)" (%0, %64) {kernel_key:<backend:GPU|layout:NCHW|dtype:bfloat16>,kernel_name:"embedding",op_name:"pd_op.embedding",origin_id:3218,padding_idx:-1,sparse:false,stop_gradient:[false]} : (gpu_tensor<-1xi64>, gpu_tensor<103424x1024xbf16>) -> gpu_tensor<-1x1024xbf16>
        (%66, %67, %68) = "rms_norm(phi_kernel)" (%65, <<NULL VALUE>>, <<NULL VALUE>>, %63, <<NULL VALUE>>) {begin_norm_axis:1,epsilon:1e-05,kernel_key:<backend:GPU|layout:NCHW|dtype:bfloat16>,kernel_name:"rms_norm",op_name:"pd_op.rms_norm",origin_id:3219,quant_max_bound:0,quant_min_bound:0,quant_round_type:0,quant_scale:-1,stop_gradient:[false,false,false]} : (gpu_tensor<-1x1024xbf16>, <<NULL TYPE>>, <<NULL TYPE>>, gpu_tensor<1024xbf16>, <<NULL TYPE>>) -> gpu_tensor<-1x1024xbf16>, <<NULL TYPE>>, gpu_tensor<-1xf32>
        (%69) = "weight_only_linear(phi_kernel)" (%66, %62, <<NULL VALUE>>, %61) {arch:90,group_size:-1,kernel_key:<backend:GPU|layout:NCHW|dtype:bfloat16>,kernel_name:"weight_only_linear",op_name:"pd_op.weight_only_linear",origin_id:3220,stop_gradient:[false],weight_dtype:"int8"} : (gpu_tensor<-1x1024xbf16>, gpu_tensor<2560x1024xi8>, <<NULL TYPE>>, gpu_tensor<2560xbf16>) -> gpu_tensor<-1x2560xbf16>
        (%70) = "shape64(phi_kernel)" (%69) {kernel_key:<backend:GPU|layout:NCHW|dtype:bfloat16>,kernel_name:"shape64",op_name:"pd_op.shape64",origin_id:3221,stop_gradient:[true]} : (gpu_tensor<-1x2560xbf16>) -> cpu_tensor<2xi64>
        (%71) = "full_int_array(phi_kernel)" () {dtype:int64,kernel_key:<backend:CPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"full_int_array",op_name:"pd_op.full_int_array",origin_id:3222,place:Place(cpu),stop_gradient:[true],value:[0]} : () -> cpu_tensor<1xi64>
        (%72) = "full_int_array(phi_kernel)" () {dtype:int64,kernel_key:<backend:CPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"full_int_array",op_name:"pd_op.full_int_array",origin_id:3223,place:Place(cpu),stop_gradient:[true],value:[1]} : () -> cpu_tensor<1xi64>
        (%73) = "slice(phi_kernel)" (%70, %71, %72) {axes:[0],decrease_axis:[0],infer_flags:[1],kernel_key:<backend:CPU|layout:NCHW|dtype:int64>,kernel_name:"slice",op_name:"pd_op.slice",origin_id:3224,stop_gradient:[true]} : (cpu_tensor<2xi64>, cpu_tensor<1xi64>, cpu_tensor<1xi64>) -> cpu_tensor<i64>
        (%74) = "full(phi_kernel)" () {dtype:int64,kernel_key:<backend:CPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"full",op_name:"pd_op.full",origin_id:3225,place:Place(cpu),shape:[],stop_gradient:[true],value:2048} : () -> cpu_tensor<i64>
		.....
        (%78) = "memcpy_h2d(phi_kernel)" (%71) {dst_place_type:1,kernel_key:<backend:GPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"memcpy_h2d",op_name:"pd_op.memcpy_h2d",origin_id:3229} : (cpu_tensor<1xi64>) -> gpu_tensor<1xi64>
        (%79) = "memcpy_h2d(phi_kernel)" (%72) {dst_place_type:1,kernel_key:<backend:GPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"memcpy_h2d",op_name:"pd_op.memcpy_h2d",origin_id:3230} : (cpu_tensor<1xi64>) -> gpu_tensor<1xi64>
        (%80) = "memcpy_h2d(phi_kernel)" (%74) {dst_place_type:1,kernel_key:<backend:GPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"memcpy_h2d",op_name:"pd_op.memcpy_h2d",origin_id:3231} : (cpu_tensor<i64>) -> gpu_tensor<i64>
        () = "cf.yield" [id:3232] (%65, %69, %78, %79, %80, %77) {origin_id:3120} : (gpu_tensor<-1x1024xbf16>, gpu_tensor<-1x2560xbf16>, gpu_tensor<1xi64>, gpu_tensor<1xi64>, gpu_tensor<i64>, gpu_tensor<-1x2048xbf16>) -> 
    }

原因是我们在外部指定了输出的 place 为 GPUPlace,所以即使最后的 yield OP 输出带有 cpu_tensor,也会被在 lower 过程中插入的 memcpy_h2d OP 搬运到 GPU 上

因此,在 ProcessBlock 之前,我们指定 place 为默认 phi::Place ,其 AllocationTypeUNDEFINED

class TEST_API Place {
 public:
  Place()
      : device(0), alloc_type_(AllocationType::UNDEFINED), device_type_id_(0) {}

这样执行器就不会向图中插入memcpy_h2d OP,在 ProcessBlock 之后,我们再将 CUDAGraphOp 的输出类型设置回来就行:

  for (size_t i = 0; i < yield_op.num_operands(); ++i) {
    new_cg_op->result(i).set_type(yield_op.operand_type(i));
  }

最终的图如下,不再有 memcpy:

    (%55, %56, %57, %58, %59, %60) = "pd_op.cuda_graph" [id:6315] () {} : () -> gpu_tensor<-1x1024xbf16>, gpu_tensor<-1x2560xbf16>, cpu_tensor<1xi64>, cpu_tensor<1xi64>, cpu_tensor<i64>, gpu_tensor<-1x2048xbf16>
    {
		......
        (%65) = "embedding(phi_kernel)" (%0, %64) {kernel_key:<backend:GPU|layout:NCHW|dtype:bfloat16>,kernel_name:"embedding",op_name:"pd_op.embedding",origin_id:6320,padding_idx:-1,sparse:false,stop_gradient:[false]} : (gpu_tensor<-1xi64>, gpu_tensor<103424x1024xbf16>) -> gpu_tensor<-1x1024xbf16>
        (%66, %67, %68) = "rms_norm(phi_kernel)" (%65, <<NULL VALUE>>, <<NULL VALUE>>, %63, <<NULL VALUE>>) {begin_norm_axis:1,epsilon:1e-05,kernel_key:<backend:GPU|layout:NCHW|dtype:bfloat16>,kernel_name:"rms_norm",op_name:"pd_op.rms_norm",origin_id:6321,quant_max_bound:0,quant_min_bound:0,quant_round_type:0,quant_scale:-1,stop_gradient:[false,false,false]} : (gpu_tensor<-1x1024xbf16>, <<NULL TYPE>>, <<NULL TYPE>>, gpu_tensor<1024xbf16>, <<NULL TYPE>>) -> gpu_tensor<-1x1024xbf16>, <<NULL TYPE>>, gpu_tensor<-1xf32>
        (%69) = "weight_only_linear(phi_kernel)" (%66, %62, <<NULL VALUE>>, %61) {arch:90,group_size:-1,kernel_key:<backend:GPU|layout:NCHW|dtype:bfloat16>,kernel_name:"weight_only_linear",op_name:"pd_op.weight_only_linear",origin_id:6322,stop_gradient:[false],weight_dtype:"int8"} : (gpu_tensor<-1x1024xbf16>, gpu_tensor<2560x1024xi8>, <<NULL TYPE>>, gpu_tensor<2560xbf16>) -> gpu_tensor<-1x2560xbf16>
        (%70) = "shape64(phi_kernel)" (%69) {kernel_key:<backend:GPU|layout:NCHW|dtype:bfloat16>,kernel_name:"shape64",op_name:"pd_op.shape64",origin_id:6323,stop_gradient:[true]} : (gpu_tensor<-1x2560xbf16>) -> cpu_tensor<2xi64>
        (%71) = "full_int_array(phi_kernel)" () {dtype:int64,kernel_key:<backend:CPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"full_int_array",op_name:"pd_op.full_int_array",origin_id:6324,place:Place(cpu),stop_gradient:[true],value:[0]} : () -> cpu_tensor<1xi64>
        (%72) = "full_int_array(phi_kernel)" () {dtype:int64,kernel_key:<backend:CPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"full_int_array",op_name:"pd_op.full_int_array",origin_id:6325,place:Place(cpu),stop_gradient:[true],value:[1]} : () -> cpu_tensor<1xi64>
        (%73) = "slice(phi_kernel)" (%70, %71, %72) {axes:[0],decrease_axis:[0],infer_flags:[1],kernel_key:<backend:CPU|layout:NCHW|dtype:int64>,kernel_name:"slice",op_name:"pd_op.slice",origin_id:6326,stop_gradient:[true]} : (cpu_tensor<2xi64>, cpu_tensor<1xi64>, cpu_tensor<1xi64>) -> cpu_tensor<i64>
        (%74) = "full(phi_kernel)" () {dtype:int64,kernel_key:<backend:CPU|layout:Undefined(AnyLayout)|dtype:int64>,kernel_name:"full",op_name:"pd_op.full",origin_id:6327,place:Place(cpu),shape:[],stop_gradient:[true],value:2048} : () -> cpu_tensor<i64>
        (%75) = "builtin.combine" [id:6328] (%73, %74) {origin_id:5988,stop_gradient:[true]} : (cpu_tensor<i64>, cpu_tensor<i64>) -> vec[cpu_tensor<i64>,cpu_tensor<i64>]
        (%76) = "stack(phi_kernel)" (%75) {axis:0,kernel_key:<backend:CPU|layout:NCHW|dtype:int64>,kernel_name:"stack",op_name:"pd_op.stack",origin_id:6329,stop_gradient:[true]} : (vec[cpu_tensor<i64>,cpu_tensor<i64>]) -> cpu_tensor<2xi64>
        (%77) = "empty(phi_kernel)" (%76) {dtype:bfloat16,kernel_key:<backend:GPU|layout:Undefined(AnyLayout)|dtype:bfloat16>,kernel_name:"empty",op_name:"pd_op.empty",origin_id:6330,place:Place(undefined:0),stop_gradient:[true]} : (cpu_tensor<2xi64>) -> gpu_tensor<-1x2048xbf16>
        () = "cf.yield" [id:6331] (%65, %69, %71, %72, %74, %77) {origin_id:6222} : (gpu_tensor<-1x1024xbf16>, gpu_tensor<-1x2560xbf16>, cpu_tensor<1xi64>, cpu_tensor<1xi64>, cpu_tensor<i64>, gpu_tensor<-1x2048xbf16>) -> 
    }

@DrRyanHuang DrRyanHuang requested a review from SigureMo September 4, 2025 05:22
@paddle-bot
Copy link

paddle-bot bot commented Sep 4, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Member

@SigureMo SigureMo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加个单测吧

@codecov-commenter
Copy link

codecov-commenter commented Sep 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@3011278). Learn more about missing BASE report.

Additional details and impacted files
@@             Coverage Diff             @@
##             develop    #75078   +/-   ##
===========================================
  Coverage           ?   100.00%           
===========================================
  Files              ?         1           
  Lines              ?         7           
  Branches           ?         0           
===========================================
  Hits               ?         7           
  Misses             ?         0           
  Partials           ?         0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@SigureMo SigureMo changed the title Use pir::Place in CudaGraphOp output to avoid memcpy [Dy2St][CUDAGraph] Set undefined place for CUDAGraph OP outputs before lowering to avoid unnecessary memcpy Sep 8, 2025
@DrRyanHuang DrRyanHuang changed the title [Dy2St][CUDAGraph] Set undefined place for CUDAGraph OP outputs before lowering to avoid unnecessary memcpy [Dy2St][CUDAGraph] Set undefined place for CUDAGraph OP outputs before lowering to avoid unnecessary memcpy Sep 8, 2025
@SigureMo
Copy link
Member

SigureMo commented Sep 8, 2025

原有的切图逻辑,会导致执行器在子图的末尾插入不必要的 OP memcpy_h2d

重点不在切图,在于 lower,这个 PR 没修改切图逻辑

原因是我们在外部指定了输出的 place 为 GPUPlace

expected place,在当前 case 下 expected place 是 GPU

SigureMo
SigureMo previously approved these changes Sep 11, 2025
Copy link
Member

@SigureMo SigureMo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMeow 🐾

gouzil
gouzil previously approved these changes Sep 11, 2025
SigureMo
SigureMo previously approved these changes Sep 11, 2025
@@ -9,6 +9,8 @@ set(SOT_ENVS SOT_LOG_LEVEL=0 MIN_GRAPH_SIZE=0 STRICT_MODE=False
# swgu98: Temporarily commented on Windows platform
if(WIN32)
list(REMOVE_ITEM TEST_OPS test_for_enumerate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swgu98 这里记得修一下

kolinwei
kolinwei previously approved these changes Sep 11, 2025
@DrRyanHuang DrRyanHuang dismissed stale reviews from kolinwei and SigureMo via 970c194 September 12, 2025 04:49
@DrRyanHuang DrRyanHuang merged commit 2feb9e4 into PaddlePaddle:develop Sep 15, 2025
73 of 76 checks passed
@DrRyanHuang DrRyanHuang deleted the cg branch September 15, 2025 02:39
co63oc pushed a commit to co63oc/Paddle that referenced this pull request Sep 18, 2025
@DrRyanHuang DrRyanHuang changed the title [Dy2St][CUDAGraph] Set undefined place for CUDAGraph OP outputs before lowering to avoid unnecessary memcpy [Dy2St][CUDAGraph] Set undefined place for CUDAGraph OP outputs before lowering to avoid unnecessary memcpy && Add CUDAGraph unitest Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants