Releases · ggml-org/llama.cpp

Release list

b8303

github-actions released this 13 Mar 07:20

b8303

fdb1764

model : add support for Phi4ForCausalLMV (#20168)

Add support for Phi4ForCausalLMV.
Fix Phi-4 vision parity (correcting SigLIP2 patch-kernel export layout) and matching HF NaFlex resize behavior in mtmd.
Rename contants + fix tokenizer label
Clean-ups.
Fix GGUF export.
Set tokenizer.ggml.pre explicitly.
Default vocab name rather than forcing it.
Clean-ups.
Fix indent.
Fix subscriptable error.
remov overcomplicated code path
Clean-ups.

Co-authored-by: Xuan Son Nguyen son@huggingface.co

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 23

b8301

github-actions released this 13 Mar 07:04

b8301

4a748b8

common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (#20416)

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 23

b8300

github-actions released this 13 Mar 03:26

b8300

f2ab047

ggml-webgpu: Add supports for GGML_OP_REPEAT (#20230)

Add GGML_OP_REPEAT to webgpu backend.
Add i16 support for GGML_OP_REPEAT.

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 23

b8299

github-actions released this 13 Mar 02:42

b8299

d28961d

llama : enable chunked fused GDN path (#20340)

llama : enable chunked fused GDN path
models : avoid Q and K repeats when using fused GDA
cont : fix comment

Co-authored-by: Aman Gupta amangupta052@gmail.com

cont : fix the fix

Co-authored-by: Aman Gupta amangupta052@gmail.com

cont : fix
metal : add GDN kernel (#20361)
metal : add Metal backend for GGML_OP_GATED_DELTA_NET

Add a fused Metal kernel for the gated delta net recurrence op
(#19504), enabling GPU-accelerated inference for DeltaNet-based
models (Qwen3.5, etc.) on Apple Silicon.

Supports both GDA (scalar gate) and KDA (per-row gate) modes
with head_size 64 and 128. Unsupported configurations (head_size
32, non-contiguous tensors) gracefully fall back to CPU.

Performance: Qwen3.5-0.8B Q4_K_M on M4 Max
tg128: 170 -> 213 t/s (+25%)

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

metal : validate contiguity of all input tensors in supports_op

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

metal : add algorithm equivalence comment for GDA decay path

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

cont : unslop + optimize
cont : clean-up

Co-authored-by: Paul Flynn paul@arkavo.com
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com

CUDA: AR gated delta net improvements (#20391)
Add FastDiv to gated_delta_net_cuda
Shard columns across warps

This reduces register pressure (avoids spill for S_v = 128) and gives
the warp-scheduler more CTAs to schedule (thus hiding data-access
latencies).

Remove unneded include in gated_delta_net.cu
Improve comments
Apply code-formating
Make sharding HIP-compatible

Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly
Add test with partial warp to test sum reduction on CUDA

Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t
Rename variables
Enable GDN also for prefill, move TODO for chunked_GDN
Actually remove the TODO from 2068908
Get warp size at runtime

warp_size is not known at compile time in hip host code.

Don't expose ggml_cuda_get_physical_warp_size on host

Co-authored-by: uvos devnull@uvos.xyz

llama : refactor llm_build_delta_net_base API

Co-authored-by: Aman Gupta amangupta052@gmail.com
Co-authored-by: Paul Flynn paul@arkavo.com
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
Co-authored-by: Oliver Simons osimons@nvidia.com
Co-authored-by: uvos devnull@uvos.xyz

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 23

b8298

github-actions released this 13 Mar 00:24

b8298

f90bd1d

llama : whitespace cleanup (#20422)

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 23

b8297

github-actions released this 13 Mar 00:17

b8297

5eae9cb

ggml : add NVFP4 quantization type support (#19769)

WIP: add NVFP4 quantization support
tests
improve NVFP4 dot product implementation performance and fix bad super call
typo
Use nvfp4 kvalues
vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table
vulcal and perf fixes
wip
Fix metal
fix vulcan
Rename threshold & fix wrong scale
Fix MOE
Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD)

Remove NVFP4 support from GPU backends and architecture-specific
optimized dot products. These should be added in separate PRs so
backend specialists can review them independently.

Reverted files:

ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh,
quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh
ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h,
ggml-metal-ops.cpp
ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/*
ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c

Core NVFP4 support (type definition, CPU fallback dot product,
quantization, dequantization, conversion) is retained.

Fix arch-fallback.h: add NVFP4 generic fallback for all platforms

After shelving backend-specific SIMD implementations, the generic
CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390
platforms that previously relied on arch-specific versions.

quantize: add NVFP4 as a quantization type option
Fix ggml_fp32_to_ue4m3: handle subnormal values

Previously, values with ue4m3_exp <= 0 were clamped to 0, causing
all small scales to underflow. This made NVFP4 quantization via
llama-quantize produce garbage (PPL = 5.8M) since typical transformer
weights have amax/6.0 in the range 0.001-0.01, which falls in the
UE4M3 subnormal range.

Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7),
matching the decode path in ggml_ue4m3_to_fp32.

Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33),
comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15).

Restore ARM NEON NVFP4 dot product implementation

Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using
vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products.

tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup

Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq

Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy
ggml_ue4m3_to_fp32() in the hot loop
Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32
Accumulate with vfmaq_f32 into float32x4_t vector accumulators

tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed)

ARM NEON NVFP4: rearrange q8 to match nibble layout

Alternative approach: rearrange q8 data to match the NVFP4 lo/hi
nibble layout instead of rearranging the looked-up NVFP4 values.
Eliminates vcombine_s8(vget_low, vget_low) shuffles.

Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x
block overhead from QK=16 vs QK=32, not the shuffle instructions.

CPU only backend 64 super-block layout
cleanup
Remove unused LUT
int
exclude NVFP4 from unsupported ops in metal build
remove quantization for now
store scales as native UE4M3, preserve original model bits when possible
Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

correct comment
format
reduce duplication and cleanup
Address comments
move detection to prepare_tensors
Use math instead of const
Move
fix comment
Shelf quantize tests
Rebase and move check
cleanup
lint
Update gguf-py/gguf/scripts/gguf_convert_endian.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Use fallback quant config
Simplify

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

organize
Refactor
Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

add quantize_nvfp4 (required for test_quants.py)
add quantize_nvfp4 (required for test_quants.py)
add quantize_nvfp4 (required for test_quants.py)
fix return type

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 23

b8295

github-actions released this 12 Mar 23:56

b8295

eaf1d79

llama : add support for Nemotron 3 Super (#20411)

llama : add support for Nemotron 3 Super

This commit adds support for the Nemotron 3 Super model (120B.A12B)
enabling this model to be converted to GGUF format and run in llama.cpp.

Co-authored-by: Georgi Gerganov ggerganov@gmail.com
Co-authored-by: Matt Clayton 156335168+mattjcly@users.noreply.github.com

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 23

b8292

github-actions released this 12 Mar 17:48

b8292

b541241

metal : fix q5_k mul_mv register spill (#20399)

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 23

b8291

github-actions released this 12 Mar 16:17

b8291

c363256

metal : add env var to trigger graph capture (#20398)

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 23

b8287

github-actions released this 12 Mar 09:21

b8287

acb7c79

common/parser: handle reasoning budget (#20297)

v1
Finished!
Handlie cli
Reasoning sampler
Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Less explosive terminology :)
Add utf-8 case and tests
common : migrate reasoning budget sampler to common
cont : clean up
cont : expose state and allow passing as initial state
cont : remove unused imports
cont : update state machine doc string

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
Co-authored-by: Alde Rojas hello@alde.dev

macOS/iOS:

Linux:

Windows:

openEuler:

Assets 23

Uh oh!

Releases: ggml-org/llama.cpp

Release list

b8303

Uh oh!

b8301

Uh oh!

b8300

Uh oh!

b8299

Uh oh!

b8298

Uh oh!

b8297

Uh oh!

b8295

Uh oh!

b8292

Uh oh!

b8291

Uh oh!

b8287

Uh oh!