Skip to content

Releases: NVIDIA/TileGym

v1.0.0

11 Mar 00:21

Choose a tag to compare

v1.0.0 Pre-release
Pre-release

What's Changed

  • [Bug fix] use padding_mode inside the kernel to process elements out of boundary for softmax by @xjmxyt in #1
  • [Bug fix] use ct.gather ct.store for softmax's no-tma op by @yifeis-nv in #2
  • Add PR bot to repository by @arjkesh in #3
  • Update README.md by @xjmxyt in #5
  • remove dead code in silu_and_mul kernel - creates output offsets (for 1D), expect n_elements param... but no need... by @lessw2020 in #6
  • Initialize TileGym CI by @arjkesh in #4
  • Use ruff formatter, introduce helper dev script by @arjkesh in #11
  • Introduce job timeouts, speed up builds by @camille-004 in #9
  • [FEA] add gelu & relu by @xjmxyt in #13
  • Update dockerfile to use cuda 13.1 base image by @arjkesh in #12
  • [Fix] Refactor nightly skip logic by @arjkesh in #8
  • Add automatic header checks and formatting by @arjkesh in #14
  • Standardize softmax.py to avoid numpy dependency by @lessw2020 in #16
  • [Update] update kernels and reformat codes by @hannahli-nv in #18
  • [FEA] Add dropout by @hannahli-nv in #19
  • Split-K reduction kernel cleanup by @lessw2020 in #21
  • Fix: moe_align_block_size() supports non-power-of-2 num_experts by @huanghua1994 in #24
  • Update autotuner: use experimental autotuner in cutile-python by @xjmxyt in #25
  • feat: chunked softmax implementation for large column size by @aghilann in #17
  • [Update] Add benchmark and autotune for group_gemm by @xjmxyt in #26
  • Fix benchmark failure cases by @arjkesh in #27
  • Format benchmark files as json, add perf thresholds by @arjkesh in #15
  • feat: RMSNorm backward pass kernels by @aghilann in #29
  • Split-K reduction: remove un-needed scaling via INV_LOG_2 by @lessw2020 in #22
  • [fix] Update benchmark sparse checkout by @arjkesh in #30
  • [FEA] Add bmm by @hannahli-nv in #31
  • Temporarily avoid job failures due to inconsistent benchmarks by @arjkesh in #32
  • [Update] Fix bmm issue by @hannahli-nv in #34
  • [FEA] Add Qwen2-7B module by @hannahli-nv in #36
  • Update for ragged_bmm moe by @hannahli-nv in #37
  • Add env "DISABLE_FALLBACK" & fix type hint error & other updates by @hannahli-nv in #39
  • Add reusable retry workflow for runner availability timeouts by @arjkesh in #35
  • Add mHC fused kernels and tests by @Edward-lyz in #38
  • Update some comments by @hannahli-nv in #42
  • Add tilegym wheel building by @arjkesh in #41
  • fix matmul illegal address error on DGX Spark by @xjmxyt in #44
  • fix qwen2 fp16 bug by @hannahli-nv in #43
  • [Fix] fix num_kv_split becomes 0 by @xjmxyt in #45
  • Avoid OOM for large GEMM 32k & modify layernorm cutile by @hannahli-nv in #50
  • Add option to ignore specific wheel validations by @arjkesh in #51
  • Add road map by @hannahli-nv in #52
  • [FEA] Add SwiGLU backward pass implementation, test cases and benchmark by @Weili-0234 in #46
  • Enable experimental_kernel marker by @hannahli-nv in #53
  • [FEA] Add FlashAttention backward pass implementation, test cases and benchmark by @Weili-0234 in #49
  • Update README.md by @xjmxyt in #54
  • Add version for tilegym wheels, update reusable workflow by @arjkesh in #55
  • Fix import error for experimental marker & support gemma 3 & other updates by @hannahli-nv in #57
  • Add tilegym homepage to setup.py by @arjkesh in #58
  • Update MoE by @hannahli-nv in #59
  • fix torch dependency by @xjmxyt in #61
  • feat: replace RMSNorm backward with persistent CuTile kernel by @aghilann in #60
  • Scan for CVEs in wheels, fix python versions by @arjkesh in #64
  • feat: add CuTile RoPE backward with tests and backward benchmark by @aghilann in #62
  • A fix for silu_and_mul & Update codes & other updates by @hannahli-nv in #67
  • Add workflow to prepare release tag and artifacts by @arjkesh in #66
  • Update moe type hint & Update gitignore & other updates by @hannahli-nv in #68
  • add cutile kernel skill and Move install_requires dependencies to requirements.txt by @hannahli-nv in #69
  • Add SECURITY.md with vulnerability reporting instructions & Add SPDX license header to SECURITY.md & other updates by @hannahli-nv in #71
  • feat: swiglu forward optimizations by @aghilann in #63
  • feat: chunked fused linear cross-entropy kernel forward by @aghilann in #65
  • Update attention & Add .venv to ruff exclude list by @hannahli-nv in #72

New Contributors

Full Changelog: https://github.com/NVIDIA/TileGym/commits/v1.0.0