[PHI] Two Stage Scatter/Gather Kernel for Fully Synced Results #74967
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Category
Operator Mechanism
PR Types
Improvements
Description
paddle.put_along_axis
系列的 API 在数据量稍大时,include_self=False
模式下,部分计算结果就不正确 #72803 中提到的问题。本 PR 的三个主要改动:
reduce!=assign
模式 forward下转为两(mean是三)阶段计算:(1)如果include self=False
,对应位置赋值对应操作的的 init value(比如max是对应的 numeric_limit,mean是0)。(2)reduce op 计算,无额外原子操作使用。(3)如果是mean,则进行 cast div(后续与PyTorch完全对齐,修改int类型舍入规则将在对应 kernel 内增加)。这样的修改使得reduce!=assign
的情况进一步削减了内存开销:ScatterScalarKernel
,用于在输入的 values 仅为单个 scalar 时,不再通过python端的broadcast_to
产生对应的view。此 kernel 除了在reduce!=assign
,include_self=False
的情况有赋初值的作用外,还可以作为 scalar 的一个faster pass。faster pass相关的进一步优化工作拟转社区开发。在 tensor 较大的情况下,blocks 较多时,对
aux_buffer
的操作无法仅用__syncthreads()
来同步(intra-grid同步无法通过intra-block sync达成)。目前采用的方法与 #74854 类似,拆为两个 kernel。这么做有如下考虑:具有一定的重复性,并且工作量也不小,拟后续转社区任务。
完备的前、反向测试脚本见这个 gist:Enigmatisms/scatter_gather_forward_backward.py
Pcard-89620