yan (炎) is a high-performance CUDA operator library designed for learning purposes while emphasizing clean code and maximum performance. Built on DeepGEMM's JIT framework and CuTe implementation, yan delivers efficient operators optimized primarily for RTX 4090 GPUs.
-
Diverse Operator Suite: Implementation of multiple high-performance operators including:
- Reduction
- Scan (prefix sum)
- General Matrix Multiplication (GEMM)
- Online Softmax
- Flash Attention
- Custom Triplane Sampling Operator
-
Performance Highlights:
- GEMM and Softmax operators achieve 1.5x throughput compared to PyTorch implementations
- Flash Attention performance reaches 98% of Dao AI Lab's Flash Attention 2 implementation
- Fused Triplane Sampling operator delivers 3x speed improvement
- JIT Framework: Based on DeepGEMM
- Tensor Engine: Powered by NVIDIA's CuTe
- Optimization Target: Primarily optimized for NVIDIA RTX 4090 GPUs
Operator | Performance vs. Baseline |
---|---|
GEMM | 1.5x vs. PyTorch |
Softmax | 1.5x vs. PyTorch |
Flash Attention | 98% of Flash Attention 2 |
Triplane Sampling | 3x improvement with fusion |
This project aims to serve as both a learning resource and a high-performance library, demonstrating how clean, well-structured code can achieve exceptional performance for critical deep learning operations.
This project draws inspiration from: