Releases: ggml-org/llama.cpp
Release list
b8303
model : add support for Phi4ForCausalLMV (#20168)
-
Add support for Phi4ForCausalLMV.
-
Fix Phi-4 vision parity (correcting SigLIP2 patch-kernel export layout) and matching HF NaFlex resize behavior in mtmd.
-
Rename contants + fix tokenizer label
-
Clean-ups.
-
Fix GGUF export.
-
Set tokenizer.ggml.pre explicitly.
-
Default vocab name rather than forcing it.
-
Clean-ups.
-
Fix indent.
-
Fix subscriptable error.
-
remov overcomplicated code path
-
Clean-ups.
Co-authored-by: Xuan Son Nguyen son@huggingface.co
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler:
b8301
common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (#20416)
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler:
b8300
ggml-webgpu: Add supports for GGML_OP_REPEAT (#20230)
-
Add GGML_OP_REPEAT to webgpu backend.
-
Add i16 support for GGML_OP_REPEAT.
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler:
b8299
llama : enable chunked fused GDN path (#20340)
-
llama : enable chunked fused GDN path
-
models : avoid Q and K repeats when using fused GDA
-
cont : fix comment
Co-authored-by: Aman Gupta amangupta052@gmail.com
- cont : fix the fix
Co-authored-by: Aman Gupta amangupta052@gmail.com
-
cont : fix
-
metal : add GDN kernel (#20361)
-
metal : add Metal backend for GGML_OP_GATED_DELTA_NET
Add a fused Metal kernel for the gated delta net recurrence op
(#19504), enabling GPU-accelerated inference for DeltaNet-based
models (Qwen3.5, etc.) on Apple Silicon.
Supports both GDA (scalar gate) and KDA (per-row gate) modes
with head_size 64 and 128. Unsupported configurations (head_size
32, non-contiguous tensors) gracefully fall back to CPU.
Performance: Qwen3.5-0.8B Q4_K_M on M4 Max
tg128: 170 -> 213 t/s (+25%)
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- metal : validate contiguity of all input tensors in supports_op
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
- metal : add algorithm equivalence comment for GDA decay path
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com
-
cont : unslop + optimize
-
cont : clean-up
Co-authored-by: Paul Flynn paul@arkavo.com
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
-
CUDA: AR gated delta net improvements (#20391)
-
Add FastDiv to gated_delta_net_cuda
-
Shard columns across warps
This reduces register pressure (avoids spill for S_v = 128) and gives
the warp-scheduler more CTAs to schedule (thus hiding data-access
latencies).
-
Remove unneded include in gated_delta_net.cu
-
Improve comments
-
Apply code-formating
-
Make sharding HIP-compatible
- Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly
- Add test with partial warp to test sum reduction on CUDA
-
Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t
-
Rename variables
-
Enable GDN also for prefill, move TODO for chunked_GDN
-
Actually remove the TODO from 2068908
-
Get warp size at runtime
warp_size is not known at compile time in hip host code.
- Don't expose ggml_cuda_get_physical_warp_size on host
Co-authored-by: uvos devnull@uvos.xyz
- llama : refactor llm_build_delta_net_base API
Co-authored-by: Aman Gupta amangupta052@gmail.com
Co-authored-by: Paul Flynn paul@arkavo.com
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
Co-authored-by: Oliver Simons osimons@nvidia.com
Co-authored-by: uvos devnull@uvos.xyz
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler:
b8298
llama : whitespace cleanup (#20422)
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler:
b8297
ggml : add NVFP4 quantization type support (#19769)
-
WIP: add NVFP4 quantization support
-
tests
-
improve NVFP4 dot product implementation performance and fix bad super call
-
typo
-
Use nvfp4 kvalues
-
vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table
-
vulcal and perf fixes
-
wip
-
Fix metal
-
fix vulcan
-
Rename threshold & fix wrong scale
-
Fix MOE
-
Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD)
Remove NVFP4 support from GPU backends and architecture-specific
optimized dot products. These should be added in separate PRs so
backend specialists can review them independently.
Reverted files:
- ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh,
quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh - ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h,
ggml-metal-ops.cpp - ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/*
- ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c
Core NVFP4 support (type definition, CPU fallback dot product,
quantization, dequantization, conversion) is retained.
- Fix arch-fallback.h: add NVFP4 generic fallback for all platforms
After shelving backend-specific SIMD implementations, the generic
CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390
platforms that previously relied on arch-specific versions.
-
quantize: add NVFP4 as a quantization type option
-
Fix ggml_fp32_to_ue4m3: handle subnormal values
Previously, values with ue4m3_exp <= 0 were clamped to 0, causing
all small scales to underflow. This made NVFP4 quantization via
llama-quantize produce garbage (PPL = 5.8M) since typical transformer
weights have amax/6.0 in the range 0.001-0.01, which falls in the
UE4M3 subnormal range.
Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7),
matching the decode path in ggml_ue4m3_to_fp32.
Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33),
comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15).
- Restore ARM NEON NVFP4 dot product implementation
Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using
vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products.
tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup
- Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq
- Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy
ggml_ue4m3_to_fp32() in the hot loop - Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32
- Accumulate with vfmaq_f32 into float32x4_t vector accumulators
tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed)
- ARM NEON NVFP4: rearrange q8 to match nibble layout
Alternative approach: rearrange q8 data to match the NVFP4 lo/hi
nibble layout instead of rearranging the looked-up NVFP4 values.
Eliminates vcombine_s8(vget_low, vget_low) shuffles.
Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x
block overhead from QK=16 vs QK=32, not the shuffle instructions.
-
CPU only backend 64 super-block layout
-
cleanup
-
Remove unused LUT
-
int
-
exclude NVFP4 from unsupported ops in metal build
-
remove quantization for now
-
store scales as native UE4M3, preserve original model bits when possible
-
Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
-
correct comment
-
format
-
reduce duplication and cleanup
-
Address comments
-
move detection to prepare_tensors
-
Use math instead of const
-
Move
-
fix comment
-
Shelf quantize tests
-
Rebase and move check
-
cleanup
-
lint
-
Update gguf-py/gguf/scripts/gguf_convert_endian.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
-
Use fallback quant config
-
Simplify
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
-
organize
-
Refactor
-
Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
-
add quantize_nvfp4 (required for test_quants.py)
-
add quantize_nvfp4 (required for test_quants.py)
-
add quantize_nvfp4 (required for test_quants.py)
-
fix return type
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler:
b8295
llama : add support for Nemotron 3 Super (#20411)
- llama : add support for Nemotron 3 Super
This commit adds support for the Nemotron 3 Super model (120B.A12B)
enabling this model to be converted to GGUF format and run in llama.cpp.
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
Co-authored-by: Matt Clayton 156335168+mattjcly@users.noreply.github.com
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler:
b8292
metal : fix q5_k mul_mv register spill (#20399)
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler:
b8291
metal : add env var to trigger graph capture (#20398)
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler:
b8287
common/parser: handle reasoning budget (#20297)
-
v1
-
Finished!
-
Handlie cli
-
Reasoning sampler
-
Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
-
Less explosive terminology :)
-
Add utf-8 case and tests
-
common : migrate reasoning budget sampler to common
-
cont : clean up
-
cont : expose state and allow passing as initial state
-
cont : remove unused imports
-
cont : update state machine doc string
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
Co-authored-by: Alde Rojas hello@alde.dev
macOS/iOS:
Linux:
Windows:
- Windows x64 (CPU)
- Windows arm64 (CPU)
- Windows x64 (CUDA 12) - CUDA 12.4 DLLs
- Windows x64 (CUDA 13) - CUDA 13.1 DLLs
- Windows x64 (Vulkan)
- Windows x64 (SYCL)
- Windows x64 (HIP)
openEuler: