cattidea
diff --git a/‎.agents/skills/paddle-design-compat/SKILL.md‎
Lines changed: 204 additions & 0 deletions b/‎.agents/skills/paddle-design-compat/SKILL.md‎
Lines changed: 204 additions & 0 deletions
diff --git a/‎.agents/skills/paddle-design-compat/references/cpp-compat-layer.md‎
Lines changed: 172 additions & 0 deletions b/‎.agents/skills/paddle-design-compat/references/cpp-compat-layer.md‎
Lines changed: 172 additions & 0 deletions
@@ -0,0 +1,204 @@
+---
+name: paddle-design-compat
+description: "Use when working with Paddle's cross-ecosystem compatibility system: the PyTorch C++ API compat layer (at::Tensor, torch::*, c10::*), Python torch proxy (paddle.enable_compat), TORCH_LIBRARY operator registration, cpp_extension build system, Triton/TileLang kernel DSL integration, custom C++ operator authoring (paddle/extension.h, PD_BUILD_OP), custom Python operator (PyLayer), C++ extension (pybind11), or migrating PyTorch custom operators / third-party libraries (FlashInfer, FlashMLA, DeepGEMM etc.) to Paddle."
+---
+
+# Paddle 接口兼容性体系
+
+Paddle 3.0 提供了一套从底向上的跨生态兼容机制，使 PyTorch 生态的自定义算子库和 Kernel DSL 能以最小改动在 Paddle 中运行。同时保留了 Paddle 原生的自定义算子 / 扩展机制。
+
+## 兼容性体系总览
+
+```
+用户代码 (Python)
+  │
+  ▼  Layer 4: Python API 代理层
+  paddle.enable_compat(scope={"flashinfer"})
+  → import torch 被重定向至 paddle（基于 sys.meta_path 拦截）
+  │
+  ▼  Layer 3: Python 接口兼容层
+  paddle.compat 模块：兼容 PyTorch Python 组网 API（torch.ops、torch.nn 等）
+  │
+  ▼  Layer 2: 算子注册兼容层
+  TORCH_LIBRARY / TORCH_LIBRARY_IMPL 宏 → 注册到 Paddle 算子调度
+  pybind11 注册 → 无需修改
+  │
+  ▼  Layer 1: C++ API 兼容层
+  at::Tensor / torch::* / c10::* 命名空间
+  → 代理到 paddle::Tensor / phi API
+  │
+  ▼  Paddle 原生算子 / PHI Kernel
+```
+
+## Layer 1: C++ API 兼容层
+
+位于 `paddle/phi/api/include/compat/`，按 PyTorch 头文件结构镜像组织：
+
+| 目录 | 对应 PyTorch 命名空间 | 内容 |
+|------|----------------------|------|
+| `ATen/` | `at::` | Tensor、TensorBase、ops（empty / cat / reshape …）、CUDA context |
+| `c10/` | `c10::` | ScalarType、Device、TensorOptions、Stream、CUDAGuard |
+| `torch/` | `torch::` | Library（TORCH_LIBRARY 宏）、nn::functional |
+| `utils/` | — | 类型转换工具（IntArrayRef ↔ IntArray、ScalarType ↔ DataType） |
+
+核心设计：`at::Tensor` 是一个持有 `paddle::Tensor` 的代理类（参见 `ATen/core/TensorBody.h`），所有方法委托到 `paddle::Tensor`。类型相等性通过内部 paddle 对象判断。
+
+## Layer 2: 算子注册兼容层
+
+`torch/library.h` 提供了与 PyTorch 同名的注册宏：
+
+| 宏 | 用途 |
+|----|------|
+| `TORCH_LIBRARY(ns, m)` | 定义算子 schema（如 `m.def("muladd(Tensor a, Tensor b, float c) -> Tensor")`） |
+| `TORCH_LIBRARY_IMPL(ns, k, m)` | 注册 dispatch key 对应的实现（如 CPU、CUDA） |
+| `TORCH_LIBRARY_FRAGMENT(ns, m)` | 跨编译单元追加定义 |
+
+注册后通过 `torch.ops.<ns>.<op_name>` 调用，代理层将调用路由到 Paddle 的算子调度。
+
+## Layer 3: Python 接口兼容层
+
+`python/paddle/compat/` 模块提供 PyTorch Python API 的兼容实现：
+
+- `__init__.py`：兼容 `torch.sort`、`torch.split`、`torch.unique` 等函数签名差异
+- `nn/`：兼容 `torch.nn` 模块接口
+- `proxy.py`：torch proxy 核心实现
+
+## Layer 4: Python API 代理层（Torch Proxy）
+
+`paddle.enable_compat()` 通过 `sys.meta_path` 注入 `TorchProxyMetaFinder`，拦截 `import torch` 并重定向到 `paddle.compat`：
+
+```python
+# 全局启用
+paddle.enable_compat()
+
+# 限定作用域（推荐）
+paddle.enable_compat(scope={"flashinfer", "flash_attn"})
+
+# 上下文管理器
+with paddle.use_compat_guard():
+    import some_torch_lib
+```
+
+scope 限定后，仅对指定模块命名空间内的 `import torch` 生效，避免影响其他代码。
+
+## Paddle 原生自定义算子机制
+
+Paddle 自身也提供三种扩展方式，不依赖兼容层：
+
+### 自定义 C++ 算子
+
+通过 `paddle/extension.h` + `PD_BUILD_OP` 宏注册：
+
+```cpp
+#include "paddle/extension.h"
+
+std::vector<paddle::Tensor> ReluForward(const paddle::Tensor& x) {
+    return {paddle::relu(x)};
+}
+
+PD_BUILD_OP(custom_relu)
+    .Inputs({"X"})
+    .Outputs({"Out"})
+    .SetKernelFn(PD_KERNEL(ReluForward));
+```
+
+编译方式：
+- **setuptools**：`paddle.utils.cpp_extension.CppExtension` / `CUDAExtension`
+- **JIT**：`paddle.utils.cpp_extension.load()`
+
+### 自定义 Python 算子（PyLayer）
+
+```python
+from paddle.autograd import PyLayer
+
+class CustomOp(PyLayer):
+    @staticmethod
+    def forward(ctx, x):
+        ctx.save_for_backward(x)
+        return x.exp()
+
+    @staticmethod
+    def backward(ctx, grad):
+        x, = ctx.saved_tensor()
+        return grad * x.exp()
+
+out = CustomOp.apply(input_tensor)
+```
+
+### C++ 扩展（pybind11）
+
+通过 `PYBIND11_MODULE` 直接绑定 C++ 函数到 Python，适用于非算子场景（数据处理、工具函数等）。
+
+## Kernel DSL 生态
+
+### Triton
+
+官方 Triton 包直接支持，配合 torch proxy 即可使用：
+
+```python
+paddle.enable_compat(scope={"triton"})
+import triton
+import triton.language as tl
+```
+
+### TileLang
+
+需安装适配版本 `tilelang-paddle`，同样通过 torch proxy 使用。
+
+## 已支持的跨生态算子库
+
+| 算子库 | GitHub | 说明 |
+|--------|--------|------|
+| FlashInfer | [PFCCLab/flashinfer](https://github.com/PFCCLab/flashinfer) | 注意力算子 |
+| FlashMLA | [PFCCLab/FlashMLA](https://github.com/PFCCLab/FlashMLA) | Multi-head Latent Attention |
+| DeepGEMM | [PFCCLab/DeepGEMM](https://github.com/PFCCLab/DeepGEMM) | FP8 GEMM |
+| DeepEP | [PFCCLab/DeepEP](https://github.com/PFCCLab/DeepEP) | Expert Parallelism 通信 |
+| SonicMoE | [PFCCLab/sonic-moe](https://github.com/PFCCLab/sonic-moe) | MoE 加速 |
+| PaddleCodec | [PFCCLab/paddlecodec](https://github.com/PFCCLab/paddlecodec) | 视频编解码 |
+
+## 迁移 PyTorch 自定义算子的典型步骤
+
+1. **调整构建脚本**：`setup.py` 顶部添加 `import paddle; paddle.enable_compat()`，保留原 `torch.utils.cpp_extension` 调用
+2. **编译**：`pip install . --no-build-isolation`（Paddle 的 `cpp_extension` 替代 PyTorch 版本完成编译）
+3. **修复 C++ 编译错误**：兼容层未覆盖的 API，通过 `_PD_*` 转换函数桥接到 Paddle C++ API
+4. **Python 端**：测试脚本中 `paddle.enable_compat(scope={"your_lib"})` 后直接使用
+
+## 什么场景看什么文件
+
+| 场景 | 参考文档 |
+|------|----------|
+| C++ 兼容层头文件结构和代理实现 | [references/cpp-compat-layer.md](references/cpp-compat-layer.md) |
+| Python torch proxy 和兼容 API | [references/python-compat-layer.md](references/python-compat-layer.md) |
+| Paddle 原生自定义算子（C++ / Python / pybind11） | [references/native-custom-op.md](references/native-custom-op.md) |
+
+## 源码入口
+
+### C++ 兼容层
+
+| 模块 | 路径 |
+|------|------|
+| 兼容层根目录 | `paddle/phi/api/include/compat/` |
+| at::Tensor 代理类 | `paddle/phi/api/include/compat/ATen/core/TensorBody.h` |
+| c10 类型系统 | `paddle/phi/api/include/compat/c10/core/` |
+| TORCH_LIBRARY 注册宏 | `paddle/phi/api/include/compat/torch/library.h` |
+| 类型转换工具 | `paddle/phi/api/include/compat/utils/` |
+| torch.ops 调度实现 | `paddle/phi/api/include/compat/torch/library.cpp` |
+
+### Python 兼容层
+
+| 模块 | 路径 |
+|------|------|
+| paddle.enable_compat 入口 | `python/paddle/__init__.py`（别名 `enable_torch_proxy`） |
+| Torch Proxy 核心 | `python/paddle/compat/proxy.py` |
+| Python API 兼容实现 | `python/paddle/compat/__init__.py` |
+| nn 兼容模块 | `python/paddle/compat/nn/` |
+
+### Paddle 原生自定义算子
+
+| 模块 | 路径 |
+|------|------|
+| extension.h 总头文件 | `paddle/extension.h` |
+| op_meta_info（PD_BUILD_OP） | `paddle/phi/api/ext/op_meta_info.h` |
+| cpp_extension 编译工具 | `python/paddle/utils/cpp_extension/cpp_extension.py` |
+| extension_utils | `python/paddle/utils/cpp_extension/extension_utils.py` |
+| PyLayer | `python/paddle/autograd/py_layer.py` |
@@ -0,0 +1,172 @@
+# C++ API 兼容层
+
+## 概览
+
+C++ 兼容层位于 `paddle/phi/api/include/compat/`，按 PyTorch 的头文件目录结构镜像组织，使现有 PyTorch 自定义算子 C++ 代码无需修改即可编译。
+
+核心思路：为 PyTorch 的命名空间（`at::`、`torch::`、`c10::`）提供同名类和函数，内部委托到 Paddle 对应的 C++ API。
+
+## 目录结构
+
+```
+paddle/phi/api/include/compat/
+├── ATen/
+│   ├── core/
+│   │   ├── TensorBody.h      ← at::Tensor 代理类核心
+│   │   ├── TensorBase.h      ← TensorBase 基类，持有 paddle::Tensor
+│   │   ├── Scalar.h          ← at::Scalar → paddle::Scalar
+│   │   ├── TensorAccessor.h  ← data accessor
+│   │   ├── Generator.h       ← 随机数生成器兼容
+│   │   └── ivalue.h          ← IValue 容器（用于 TORCH_LIBRARY schema 解析）
+│   ├── cuda/
+│   │   ├── CUDAContext.h      ← at::cuda::getCurrentCUDAStream 等
+│   │   ├── CUDAStream.h       ← CUDA stream 封装
+│   │   ├── CUDAEvent.h        ← CUDA event 封装
+│   │   ├── CUDAGuard.h        ← 设备切换守卫
+│   │   ├── CUDADataType.h     ← CUDA 数据类型映射
+│   │   ├── EmptyTensor.h/cpp  ← CUDA 下的 empty tensor 创建
+│   │   └── PhiloxCudaState.h  ← Philox RNG 状态
+│   ├── ops/
+│   │   ├── empty.h            ← at::empty()
+│   │   ├── full.h             ← at::full()
+│   │   ├── ones.h / zeros.h   ← at::ones() / at::zeros()
+│   │   ├── cat.h              ← at::cat()
+│   │   ├── reshape.h          ← at::reshape()
+│   │   ├── slice.h / select.h ← at::slice() / at::select()
+│   │   ├── to.h               ← at::Tensor::to()
+│   │   └── ... (60+ ops)
+│   ├── native/
+│   │   └── cuda/Resize.h
+│   ├── ATen.h                 ← 顶层聚合头
+│   ├── Device.h
+│   ├── DeviceGuard.h
+│   ├── Functions.h
+│   ├── Tensor.h
+│   ├── TensorIndexing.h
+│   └── Utils.h/cpp
+├── c10/
+│   ├── core/
+│   │   ├── ScalarType.h       ← c10::ScalarType 枚举（映射到 phi::DataType）
+│   │   ├── Device.h/cpp       ← c10::Device（映射到 phi::Place）
+│   │   ├── TensorOptions.h    ← c10::TensorOptions（dtype + device + layout）
+│   │   ├── Stream.h/cpp       ← c10::Stream
+│   │   ├── Storage.h          ← c10::Storage
+│   │   ├── Layout.h           ← c10::Layout
+│   │   ├── MemoryFormat.h
+│   │   ├── DispatchKey.h      ← dispatch key 枚举（CPU / CUDA 等）
+│   │   ├── Scalar.h           ← c10::Scalar
+│   │   └── Allocator.h
+│   ├── cuda/
+│   │   ├── CUDAStream.h       ← c10::cuda::CUDAStream
+│   │   ├── CUDAGuard.h        ← c10::cuda::CUDAStreamGuard
+│   │   ├── CUDAFunctions.h    ← c10::cuda::device_count() 等
+│   │   └── CUDAException.h
+│   ├── macros/Macros.h        ← C10_CONCATENATE 等宏
+│   └── util/accumulate.h
+├── torch/
+│   ├── library.h/cpp          ← TORCH_LIBRARY / TORCH_LIBRARY_IMPL 宏
+│   ├── extension.h            ← torch/extension.h 入口
+│   └── csrc/api/include/torch/
+│       ├── all.h
+│       ├── cuda.h/cpp
+│       ├── python.h
+│       ├── types.h
+│       ├── sparse.h
+│       └── nn/functional.h
+├── utils/
+│   ├── scalar_type_conversion.h   ← ScalarType ↔ phi::DataType
+│   ├── int_array_ref_conversion.h ← IntArrayRef ↔ IntArray
+│   ├── dense_sparse_conversion.h  ← 稀疏张量转换
+│   ├── pinned_place.h
+│   └── macros.h
+├── CMakeLists.txt
+└── README.md
+```
+
+## at::Tensor 代理类
+
+核心设计在 `ATen/core/TensorBody.h`：
+
+```cpp
+namespace at {
+using PaddleTensor = paddle::Tensor;
+
+class Tensor : public TensorBase {
+ public:
+  // TensorBase 内部持有 paddle::Tensor tensor_
+  Tensor(const PaddleTensor& tensor) : TensorBase(tensor) {}
+
+  // 方法委托到 paddle::Tensor
+  void* data_ptr() const { return const_cast<void*>(tensor_.data()); }
+  c10::IntArrayRef sizes() const { ... }  // dims() → IntArrayRef
+  int64_t numel() const { return tensor_.numel(); }
+  c10::ScalarType dtype() const { ... }   // phi::DataType → ScalarType
+  c10::Device device() const { ... }      // Place → Device
+
+  // 内部桥接接口
+  PaddleTensor _PD_GetInner() const { return tensor_; }
+};
+}  // namespace at
+```
+
+所有 `at::Tensor` 上的方法最终调用 `paddle::Tensor` 或 `paddle::experimental::*` 的 C++ API。
+
+## 类型转换工具
+
+`utils/` 目录提供双向转换函数：
+
+| 函数 | 转换方向 |
+|------|----------|
+| `_PD_AtenScalarTypeToPhiDataType()` | `c10::ScalarType` → `phi::DataType` |
+| `_PD_PhiDataTypeToAtenScalarType()` | `phi::DataType` → `c10::ScalarType` |
+| `_PD_PhiDDimToIntArrayRef()` | `phi::DDim` → `c10::IntArrayRef` |
+| `IntArrayRef::_PD_ToPaddleIntArray()` | `c10::IntArrayRef` → `paddle::IntArray` |
+| `TensorOptions::_PD_GetPlace()` | `c10::TensorOptions` → `phi::Place` |
+
+命名约定：所有 Paddle 扩展的转换函数以 `_PD_` 前缀标识。
+
+## TORCH_LIBRARY 算子注册
+
+`torch/library.h` 实现了 PyTorch 的算子注册宏：
+
+```cpp
+// 定义算子 schema
+TORCH_LIBRARY(my_ops, m) {
+    m.def("muladd(Tensor a, Tensor b, float c) -> Tensor");
+}
+
+// 注册 CPU 实现
+TORCH_LIBRARY_IMPL(my_ops, CPU, m) {
+    m.impl("muladd", &muladd_cpu);
+}
+```
+
+实现原理：
+1. `TORCH_LIBRARY` 宏展开为 static 初始化器，构造 `torch::Library` 对象
+2. `Library::def()` 解析函数签名字符串，记录 schema
+3. `Library::impl()` 注册函数指针到 dispatch table
+4. Python 端 `torch.ops.<ns>.<name>` 查找并调用注册的实现
+
+## 常见编译问题
+
+| 错误 | 原因 | 解决方式 |
+|------|------|----------|
+| `'empty' is not a member of 'torch'` | 兼容层未实现该 op | 用 `_PD_*` 转换 + `paddle::experimental::empty()` |
+| `'ScalarType' has no member 'Half'` | 枚举值名称差异 | 检查 `c10/core/ScalarType.h` 中的映射 |
+| `cannot convert 'at::Tensor' to 'paddle::Tensor'` | 需要显式解包 | 使用 `tensor._PD_GetInner()` |
+| `undefined reference to 'at::*'` | 缺少 `.cpp` 文件链接 | 确保 CMakeLists 包含 compat 目录 |
+
+## 关键源码路径
+
+| 文件 | 说明 |
+|------|------|
+| `paddle/phi/api/include/compat/ATen/core/TensorBody.h` | at::Tensor 代理类定义 |
+| `paddle/phi/api/include/compat/ATen/core/TensorBase.h` | 基类，持有 paddle::Tensor |
+| `paddle/phi/api/include/compat/ATen/ops/*.h` | 各个 at::ops 的兼容实现 |
+| `paddle/phi/api/include/compat/c10/core/ScalarType.h` | 数据类型枚举映射 |
+| `paddle/phi/api/include/compat/c10/core/Device.h` | 设备类型映射 |
+| `paddle/phi/api/include/compat/c10/core/TensorOptions.h` | TensorOptions 兼容 |
+| `paddle/phi/api/include/compat/torch/library.h` | TORCH_LIBRARY 宏实现 |
+| `paddle/phi/api/include/compat/torch/library.cpp` | Library 类方法实现 |
+| `paddle/phi/api/include/compat/utils/scalar_type_conversion.h` | ScalarType 双向转换 |
+| `paddle/phi/api/include/compat/utils/int_array_ref_conversion.h` | IntArrayRef 双向转换 |