Open
Description
We should keep up with the latest CUDA optimizations.
aten::gather: faster gather implementation pytorch/pytorch#151490Update scatter gather and index implementation #1643aten::gather: Support more dtypes for input, indices in gather pytorch/pytorch#151822Update scatter gather and index implementation #1643- Loops kernels: [ATen][CUDA] Implement 128 bit vectorization v2 pytorch/pytorch#145746
- h2d: Tensor .cuda() very slow with specific array sizes pytorch/pytorch#153176
- Add
_foreach_fill_
ops pytorch/pytorch#150092 - cat: [aten] 8 bytes aligned vector loads for bf16 and fp16 dtypes in torch.cat pytorch/pytorch#150233