Releases · ggml-org/llama.cpp

Release list

b9873 Latest

Latest

github-actions released this 04 Jul 21:09

a410713

llama : add guard for K/V rotation input when buffer is unallocated (#25215)

llm_graph_input_attn_kv::set_input and llm_graph_input_attn_kv_iswa::set_input
call set_input_k_rot / set_input_v_rot whenever the rotation tensor pointer is
non-null, but the tensor's buffer can be unallocated (NULL) when a graph only
stores K/V without attending -- e.g. DFlash speculative decoding's KV-injection
pass. set_input_k_rot then calls ggml_backend_buffer_is_host() on a NULL buffer
and aborts with GGML_ASSERT(buffer).

Guard the four k_rot/v_rot inputs with the same "&& ->buffer" check that the
adjacent kq_mask inputs already use in these two functions. When the buffer is
unallocated there is no data to upload, so skipping is correct.

Fixes #25191

Signed-off-by: liminfei-amd 91481003+liminfei-amd@users.noreply.github.com

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2026-07-04T21:09:58Z
cudart-llama-bin-win-cuda-13.3-x64.zip

sha256:1462a050eb4c684921ba51dcc4cc488a036674c3e73e9945ee705b854808d03e

373 MB 2026-07-04T21:10:09Z
llama-b9873-bin-android-arm64.tar.gz

sha256:734635469840059d88554d9521be770fdb0fafeeb8319e61ee499e06993b830c

75.1 MB 2026-07-04T21:10:18Z
llama-b9873-bin-macos-arm64.tar.gz

sha256:2499e06ffa95f3bb852579b4b3b80f14a8079061b642c6bf2e6287f6b0a91038

10.6 MB 2026-07-04T21:10:20Z
llama-b9873-bin-macos-x64.tar.gz

sha256:724c1672e32a1c1491cd180e0b64a69c962f2047ac8d0c83b1f7394fe2e40d7d

10.9 MB 2026-07-04T21:10:21Z
llama-b9873-bin-ubuntu-arm64.tar.gz

sha256:75932910b85c12164c5c29b23b6855b17b68e36b6864837b1fa293a64204e814

12.3 MB 2026-07-04T21:10:22Z
llama-b9873-bin-ubuntu-openvino-2026.2.1-x64.tar.gz

sha256:08352eeecfa59df8c7288bf552d634d44895bbeb8860ff5ca4537024654bc450

96.6 MB 2026-07-04T21:10:23Z
llama-b9873-bin-ubuntu-rocm-7.2-x64.tar.gz

sha256:494762d88fb8463b07c66e37499177acbd3e7ef105fe6055a2665a6f934502ff

127 MB 2026-07-04T21:10:26Z
llama-b9873-bin-ubuntu-s390x.tar.gz

sha256:28135183b775c45e6d2f7ff70e71968ff797ca7f629e67d46e0aca537be2f4ff

14.1 MB 2026-07-04T21:10:31Z
llama-b9873-bin-ubuntu-sycl-fp16-x64.tar.gz

sha256:55bb336ba9b5c7e07d95b3d8b685c4533552caabdddeeb2780f106c1fffd9fa2

45.3 MB 2026-07-04T21:10:32Z
Source code (zip)

2026-07-04T20:37:38Z
Source code (tar.gz)

2026-07-04T20:37:38Z

b9871

github-actions released this 04 Jul 12:17

b9871

ef2d770

ggml : fix broken CPU concat implementation for quantized types (#25247)

ggml : fix broken CPU concat implementation for quantized types
tests : concat tests for quantized types

Co-authored-by: Stanisław Szymczyk sszymczy@gmail.com

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 2

b9870

github-actions released this 03 Jul 21:38

b9870

2d97363

chat: trim messages sent to StepFun parser (fixes long reasoning loops) (#25238)

chat: trim messages sent to StepFun parser (fixes long reasoning loops)
add regression test; remove duplicate template
chat: trim StepFun content parts before rendering

The StepFun trim workaround ran on the already-rendered messages, where
typed content parts have been concatenated into a single string, so the
per-part whitespace could no longer be reached. Move the trim ahead of
rendering and apply it to content_parts text as well as the string
content and reasoning_content. Adds a content-parts regression test.

Co-Authored-By: Piotr Wilkin ilintar@gmail.com
Assisted-By: Claude Fable 5 noreply@anthropic.com

Co-authored-by: tarruda tpadilha84@gmail.com

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

b9867

github-actions released this 03 Jul 14:56

b9867

152d337

spec: support spec-draft-p-min in DFlash (#25246)

spec: support spec-draft-p-min in DFlash
dflash: add n_min guard
dflash: guard both n_min and n_max

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

b9866

github-actions released this 03 Jul 14:20

b9866

75a48a9

cuda: enable topk-moe fusion for 288 experts (#25267)

cuda: enable topk-moe fusion for 288 experts

The topk-moe fusion only accepted power-of-2 expert counts (or the
special-cased 576), so models with 288 experts (e.g. Step-3.7-Flash)
fell back to the unfused per-layer routing chain: softmax/sigmoid,
argsort, get_rows, sum_rows, div, clamp, scale. At batch size 1 that
is ~330 extra tiny graph nodes per token.

288 is a multiple of the warp size, so the existing kernel already
handles it; this adds the missing template instantiation and accepts
288 in the eligibility check.

Measured on gfx1151 with Step-3.7-Flash IQ4_XS (llama-bench,
-b 4096 -ub 4096 -fa 1 -dio 1 -ctk q8_0 -ctv q8_0; machine idle,
before/after paired so pp4096 stays matched as a load control):

test | before | after
----------------+----------------+----------------
pp4096 | 460.99 ± 0.45 | 462.47 ± 0.34 (unchanged)
tg128 | 19.10 ± 0.04 | 19.56 ± 0.03 (+2.4%)
tg128 @ d30000 | 12.68 ± 0.04 | 12.69 ± 0.03 (unchanged)

Prompt processing is unaffected (the fusion only touches decode
routing). The decode gain is ~+2.4% at shallow context and fades with
depth: by 30k tokens each step is attention-bound over the KV cache,
so removing the fixed routing overhead is no longer visible.

Assisted-By: Claude Fable 5 noreply@anthropic.com

Update tests/test-backend-ops.cpp

Co-authored-by: Oliver Simons osimons@nvidia.com

Add comment for case 288 in topk-moe.cu

Co-authored-by: Oliver Simons osimons@nvidia.com

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

b9864

github-actions released this 03 Jul 11:28

b9864

b5315e1

server + ui: ping silent SSE streams every 1s and kick only after 3s so slow prefill never drops healthy connections (#25241)

server + ui: ping silent SSE streams every 1s and kick only after 3s so slow prefill never drops healthy connections
server + ui: sse_ping_interval becomes a per-request body field

Address review from ngxson: the global default returns to 30 so API
clients see no behavior change, and the WebUI sends sse_ping_interval: 1
in the request body since it owns the 3s visibility-kick contract and
declares the cadence it needs. Positive values keep the existing > 0
gate, -1 keeps its disabled semantics.

server: move sse_ping_interval into the request schema

Address review from ngxson: the field is now a typed field_num with
hard limits (-1, INT32_MAX) bound to task_params, seeded from the CLI
default alongside the other inherited parameters. The raw json_value
read and its redundant comment are gone, and schema evaluation brings
type and range validation for free.

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

b9862

github-actions released this 03 Jul 09:41

b9862

5a460de

Remove redundant CUDA copies after gated_delta_net. (#23940)

Remove redundant CUDA copies after gated_delta_net.

Currently, GDN writes recurrent state snapshots into its output tail, then the graph immediately copies those snapshots into ssm_states_all. With MTP draft length 3, target decode uses K=4, so that becomes 4 extra ggml_cuda_cpy calls.

The change detects that gated_delta_net -> view -> cpy pattern and makes the CUDA GDN kernel write the state snapshot(s) directly into the recurrent cache, skipping the intermediate tail writes and copy kernels when safe.

Address review comments

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

b9861

github-actions released this 03 Jul 09:05

b9861

c8ae9a7

vendor : update cpp-httplib to 0.49.0 (#25218)

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

b9860

github-actions released this 02 Jul 16:23

b9860

fdb1db8

llama : add llama_model_ftype_name() (#25134)

llama : add llama_model_ftype_name()

Expose the model file type (quantization) name, e.g. "Q8_0" or
"Q4_K - Medium", through a new public C API. The returned pointer is
valid for the lifetime of the model and nullptr when the model is
invalid or the file type is unknown.

Signed-off-by: Adrien Gallouët angt@huggingface.co

Export enum

Signed-off-by: Adrien Gallouët angt@huggingface.co

s/llama_model_ftype_name/llama_ftype_name/

Signed-off-by: Adrien Gallouët angt@huggingface.co

Move "(guessed)" to the front in llama_ftype_name

Prepend the "(guessed)" label instead of appending it. This allows removing
the non-thread-safe static std::string, making the function allocation-free.

Signed-off-by: Adrien Gallouët angt@huggingface.co

Add LLAMA_FTYPE_PREFIX

Signed-off-by: Adrien Gallouët angt@huggingface.co

Dont check for model

Signed-off-by: Adrien Gallouët angt@huggingface.co

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

b9859

github-actions released this 01 Jul 18:10

b9859

4fc4ec5

opencl: allow loading precompiled binary kernels from library (#23042)

opencl: allow loading binary kernel
opencl: add libdl.h
ggml-backend-dl is in ggml, which depends backend libs, thus
ggml-opencl cannot depend on ggml-backend-dl
add libdl.h to break cyclic dep
opencl: allow loading bin kernel lib
opencl: load gemm_moe_mxfp4_f32_ns from kernel lib if available
opencl: load q8_0 gemm from kernel lib
opencl: load q4_0 moe gemm from kernel lib
opencl: load q4_1 moe gemm from kernel lib
opencl: load q4_k moe gemm from kernel lib
opencl: always declare get_adreno_bin_kernel_func_t
opencl: rephrase message
opencl: fix for rebase
opencl: update doc

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

Uh oh!

Releases: ggml-org/llama.cpp

Release list

b9873

Uh oh!

b9871

Uh oh!

b9870

Uh oh!

b9867

Uh oh!

b9866

Uh oh!

b9864

Uh oh!

b9862

Uh oh!

b9861

Uh oh!

b9860

Uh oh!

b9859

Uh oh!