Releases: NVIDIA/TileGym
Releases · NVIDIA/TileGym
v1.0.0
What's Changed
- [Bug fix] use padding_mode inside the kernel to process elements out of boundary for softmax by @xjmxyt in #1
- [Bug fix] use ct.gather ct.store for softmax's no-tma op by @yifeis-nv in #2
- Add PR bot to repository by @arjkesh in #3
- Update README.md by @xjmxyt in #5
- remove dead code in silu_and_mul kernel - creates output offsets (for 1D), expect n_elements param... but no need... by @lessw2020 in #6
- Initialize TileGym CI by @arjkesh in #4
- Use ruff formatter, introduce helper dev script by @arjkesh in #11
- Introduce job timeouts, speed up builds by @camille-004 in #9
- [FEA] add gelu & relu by @xjmxyt in #13
- Update dockerfile to use cuda 13.1 base image by @arjkesh in #12
- [Fix] Refactor nightly skip logic by @arjkesh in #8
- Add automatic header checks and formatting by @arjkesh in #14
- Standardize softmax.py to avoid numpy dependency by @lessw2020 in #16
- [Update] update kernels and reformat codes by @hannahli-nv in #18
- [FEA] Add dropout by @hannahli-nv in #19
- Split-K reduction kernel cleanup by @lessw2020 in #21
- Fix: moe_align_block_size() supports non-power-of-2 num_experts by @huanghua1994 in #24
- Update autotuner: use experimental autotuner in cutile-python by @xjmxyt in #25
- feat: chunked softmax implementation for large column size by @aghilann in #17
- [Update] Add benchmark and autotune for group_gemm by @xjmxyt in #26
- Fix benchmark failure cases by @arjkesh in #27
- Format benchmark files as json, add perf thresholds by @arjkesh in #15
- feat: RMSNorm backward pass kernels by @aghilann in #29
- Split-K reduction: remove un-needed scaling via INV_LOG_2 by @lessw2020 in #22
- [fix] Update benchmark sparse checkout by @arjkesh in #30
- [FEA] Add bmm by @hannahli-nv in #31
- Temporarily avoid job failures due to inconsistent benchmarks by @arjkesh in #32
- [Update] Fix bmm issue by @hannahli-nv in #34
- [FEA] Add Qwen2-7B module by @hannahli-nv in #36
- Update for ragged_bmm moe by @hannahli-nv in #37
- Add env "DISABLE_FALLBACK" & fix type hint error & other updates by @hannahli-nv in #39
- Add reusable retry workflow for runner availability timeouts by @arjkesh in #35
- Add mHC fused kernels and tests by @Edward-lyz in #38
- Update some comments by @hannahli-nv in #42
- Add tilegym wheel building by @arjkesh in #41
- fix matmul illegal address error on DGX Spark by @xjmxyt in #44
- fix qwen2 fp16 bug by @hannahli-nv in #43
- [Fix] fix num_kv_split becomes 0 by @xjmxyt in #45
- Avoid OOM for large GEMM 32k & modify layernorm cutile by @hannahli-nv in #50
- Add option to ignore specific wheel validations by @arjkesh in #51
- Add road map by @hannahli-nv in #52
- [FEA] Add SwiGLU backward pass implementation, test cases and benchmark by @Weili-0234 in #46
- Enable experimental_kernel marker by @hannahli-nv in #53
- [FEA] Add FlashAttention backward pass implementation, test cases and benchmark by @Weili-0234 in #49
- Update README.md by @xjmxyt in #54
- Add version for tilegym wheels, update reusable workflow by @arjkesh in #55
- Fix import error for experimental marker & support gemma 3 & other updates by @hannahli-nv in #57
- Add tilegym homepage to setup.py by @arjkesh in #58
- Update MoE by @hannahli-nv in #59
- fix torch dependency by @xjmxyt in #61
- feat: replace RMSNorm backward with persistent CuTile kernel by @aghilann in #60
- Scan for CVEs in wheels, fix python versions by @arjkesh in #64
- feat: add CuTile RoPE backward with tests and backward benchmark by @aghilann in #62
- A fix for silu_and_mul & Update codes & other updates by @hannahli-nv in #67
- Add workflow to prepare release tag and artifacts by @arjkesh in #66
- Update moe type hint & Update gitignore & other updates by @hannahli-nv in #68
- add cutile kernel skill and Move install_requires dependencies to requirements.txt by @hannahli-nv in #69
- Add SECURITY.md with vulnerability reporting instructions & Add SPDX license header to SECURITY.md & other updates by @hannahli-nv in #71
- feat: swiglu forward optimizations by @aghilann in #63
- feat: chunked fused linear cross-entropy kernel forward by @aghilann in #65
- Update attention & Add .venv to ruff exclude list by @hannahli-nv in #72
New Contributors
- @xjmxyt made their first contribution in #1
- @yifeis-nv made their first contribution in #2
- @lessw2020 made their first contribution in #6
- @camille-004 made their first contribution in #9
- @hannahli-nv made their first contribution in #18
- @huanghua1994 made their first contribution in #24
- @aghilann made their first contribution in #17
- @Edward-lyz made their first contribution in #38
- @Weili-0234 made their first contribution in #46
Full Changelog: https://github.com/NVIDIA/TileGym/commits/v1.0.0