Skip to content

Releases: ggml-org/llama.cpp

b8303

Choose a tag to compare

@github-actions github-actions released this 13 Mar 07:20
fdb1764

model : add support for Phi4ForCausalLMV (#20168)

  • Add support for Phi4ForCausalLMV.

  • Fix Phi-4 vision parity (correcting SigLIP2 patch-kernel export layout) and matching HF NaFlex resize behavior in mtmd.

  • Rename contants + fix tokenizer label

  • Clean-ups.

  • Fix GGUF export.

  • Set tokenizer.ggml.pre explicitly.

  • Default vocab name rather than forcing it.

  • Clean-ups.

  • Fix indent.

  • Fix subscriptable error.

  • remov overcomplicated code path

  • Clean-ups.


Co-authored-by: Xuan Son Nguyen son@huggingface.co

macOS/iOS:

Linux:

Windows:

openEuler:

b8301

Choose a tag to compare

@github-actions github-actions released this 13 Mar 07:04
4a748b8

b8300

Choose a tag to compare

@github-actions github-actions released this 13 Mar 03:26
f2ab047

b8299

Choose a tag to compare

@github-actions github-actions released this 13 Mar 02:42
d28961d

llama : enable chunked fused GDN path (#20340)

  • llama : enable chunked fused GDN path

  • models : avoid Q and K repeats when using fused GDA

  • cont : fix comment

Co-authored-by: Aman Gupta amangupta052@gmail.com

  • cont : fix the fix

Co-authored-by: Aman Gupta amangupta052@gmail.com

  • cont : fix

  • metal : add GDN kernel (#20361)

  • metal : add Metal backend for GGML_OP_GATED_DELTA_NET

Add a fused Metal kernel for the gated delta net recurrence op
(#19504), enabling GPU-accelerated inference for DeltaNet-based
models (Qwen3.5, etc.) on Apple Silicon.

Supports both GDA (scalar gate) and KDA (per-row gate) modes
with head_size 64 and 128. Unsupported configurations (head_size
32, non-contiguous tensors) gracefully fall back to CPU.

Performance: Qwen3.5-0.8B Q4_K_M on M4 Max
tg128: 170 -> 213 t/s (+25%)

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

  • metal : validate contiguity of all input tensors in supports_op

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

  • metal : add algorithm equivalence comment for GDA decay path

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

  • cont : unslop + optimize

  • cont : clean-up


Co-authored-by: Paul Flynn paul@arkavo.com
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com

  • CUDA: AR gated delta net improvements (#20391)

  • Add FastDiv to gated_delta_net_cuda

  • Shard columns across warps

This reduces register pressure (avoids spill for S_v = 128) and gives
the warp-scheduler more CTAs to schedule (thus hiding data-access
latencies).

  • Remove unneded include in gated_delta_net.cu

  • Improve comments

  • Apply code-formating

  • Make sharding HIP-compatible

  1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly
  2. Add test with partial warp to test sum reduction on CUDA
  • Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t

  • Rename variables

  • Enable GDN also for prefill, move TODO for chunked_GDN

  • Actually remove the TODO from 2068908

  • Get warp size at runtime

warp_size is not known at compile time in hip host code.

  • Don't expose ggml_cuda_get_physical_warp_size on host

Co-authored-by: uvos devnull@uvos.xyz

  • llama : refactor llm_build_delta_net_base API

Co-authored-by: Aman Gupta amangupta052@gmail.com
Co-authored-by: Paul Flynn paul@arkavo.com
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com
Co-authored-by: Oliver Simons osimons@nvidia.com
Co-authored-by: uvos devnull@uvos.xyz

macOS/iOS:

Linux:

Windows:

openEuler:

b8298

Choose a tag to compare

@github-actions github-actions released this 13 Mar 00:24
f90bd1d

b8297

Choose a tag to compare

@github-actions github-actions released this 13 Mar 00:17
5eae9cb

ggml : add NVFP4 quantization type support (#19769)

  • WIP: add NVFP4 quantization support

  • tests

  • improve NVFP4 dot product implementation performance and fix bad super call

  • typo

  • Use nvfp4 kvalues

  • vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table

  • vulcal and perf fixes

  • wip

  • Fix metal

  • fix vulcan

  • Rename threshold & fix wrong scale

  • Fix MOE

  • Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD)

Remove NVFP4 support from GPU backends and architecture-specific
optimized dot products. These should be added in separate PRs so
backend specialists can review them independently.

Reverted files:

  • ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh,
    quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh
  • ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h,
    ggml-metal-ops.cpp
  • ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/*
  • ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c

Core NVFP4 support (type definition, CPU fallback dot product,
quantization, dequantization, conversion) is retained.

  • Fix arch-fallback.h: add NVFP4 generic fallback for all platforms

After shelving backend-specific SIMD implementations, the generic
CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390
platforms that previously relied on arch-specific versions.

  • quantize: add NVFP4 as a quantization type option

  • Fix ggml_fp32_to_ue4m3: handle subnormal values

Previously, values with ue4m3_exp <= 0 were clamped to 0, causing
all small scales to underflow. This made NVFP4 quantization via
llama-quantize produce garbage (PPL = 5.8M) since typical transformer
weights have amax/6.0 in the range 0.001-0.01, which falls in the
UE4M3 subnormal range.

Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7),
matching the decode path in ggml_ue4m3_to_fp32.

Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33),
comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15).

  • Restore ARM NEON NVFP4 dot product implementation

Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using
vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products.

tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup

  • Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq
  • Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy
    ggml_ue4m3_to_fp32() in the hot loop
  • Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32
  • Accumulate with vfmaq_f32 into float32x4_t vector accumulators

tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed)

  • ARM NEON NVFP4: rearrange q8 to match nibble layout

Alternative approach: rearrange q8 data to match the NVFP4 lo/hi
nibble layout instead of rearranging the looked-up NVFP4 values.
Eliminates vcombine_s8(vget_low, vget_low) shuffles.

Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x
block overhead from QK=16 vs QK=32, not the shuffle instructions.

  • CPU only backend 64 super-block layout

  • cleanup

  • Remove unused LUT

  • int

  • exclude NVFP4 from unsupported ops in metal build

  • remove quantization for now

  • store scales as native UE4M3, preserve original model bits when possible

  • Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

  • correct comment

  • format

  • reduce duplication and cleanup

  • Address comments

  • move detection to prepare_tensors

  • Use math instead of const

  • Move

  • fix comment

  • Shelf quantize tests

  • Rebase and move check

  • cleanup

  • lint

  • Update gguf-py/gguf/scripts/gguf_convert_endian.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

  • Use fallback quant config

  • Simplify

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

  • organize

  • Refactor

  • Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

  • Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

  • Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

  • add quantize_nvfp4 (required for test_quants.py)

  • add quantize_nvfp4 (required for test_quants.py)

  • add quantize_nvfp4 (required for test_quants.py)

  • fix return type


Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

macOS/iOS:

Linux:

Windows:

openEuler:

b8295

Choose a tag to compare

@github-actions github-actions released this 12 Mar 23:56
eaf1d79

llama : add support for Nemotron 3 Super (#20411)

  • llama : add support for Nemotron 3 Super

This commit adds support for the Nemotron 3 Super model (120B.A12B)
enabling this model to be converted to GGUF format and run in llama.cpp.

Co-authored-by: Georgi Gerganov ggerganov@gmail.com
Co-authored-by: Matt Clayton 156335168+mattjcly@users.noreply.github.com

macOS/iOS:

Linux:

Windows:

openEuler:

b8292

Choose a tag to compare

@github-actions github-actions released this 12 Mar 17:48
b541241

b8291

Choose a tag to compare

@github-actions github-actions released this 12 Mar 16:17
c363256

b8287

Choose a tag to compare

@github-actions github-actions released this 12 Mar 09:21
acb7c79

common/parser: handle reasoning budget (#20297)

  • v1

  • Finished!

  • Handlie cli

  • Reasoning sampler

  • Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

  • Less explosive terminology :)

  • Add utf-8 case and tests

  • common : migrate reasoning budget sampler to common

  • cont : clean up

  • cont : expose state and allow passing as initial state

  • cont : remove unused imports

  • cont : update state machine doc string


Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
Co-authored-by: Alde Rojas hello@alde.dev

macOS/iOS:

Linux:

Windows:

openEuler: