Skip to content

Releases: ggml-org/llama.cpp

b9873

Choose a tag to compare

@github-actions github-actions released this 04 Jul 21:09
a410713

llama : add guard for K/V rotation input when buffer is unallocated (#25215)

llm_graph_input_attn_kv::set_input and llm_graph_input_attn_kv_iswa::set_input
call set_input_k_rot / set_input_v_rot whenever the rotation tensor pointer is
non-null, but the tensor's buffer can be unallocated (NULL) when a graph only
stores K/V without attending -- e.g. DFlash speculative decoding's KV-injection
pass. set_input_k_rot then calls ggml_backend_buffer_is_host() on a NULL buffer
and aborts with GGML_ASSERT(buffer).

Guard the four k_rot/v_rot inputs with the same "&& ->buffer" check that the
adjacent kq_mask inputs already use in these two functions. When the buffer is
unallocated there is no data to upload, so skipping is correct.

Fixes #25191

Signed-off-by: liminfei-amd 91481003+liminfei-amd@users.noreply.github.com

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9871

Choose a tag to compare

@github-actions github-actions released this 04 Jul 12:17
ef2d770

ggml : fix broken CPU concat implementation for quantized types (#25247)

  • ggml : fix broken CPU concat implementation for quantized types

  • tests : concat tests for quantized types


Co-authored-by: Stanisław Szymczyk sszymczy@gmail.com

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9870

Choose a tag to compare

@github-actions github-actions released this 03 Jul 21:38
2d97363

chat: trim messages sent to StepFun parser (fixes long reasoning loops) (#25238)

  • chat: trim messages sent to StepFun parser (fixes long reasoning loops)

  • add regression test; remove duplicate template

  • chat: trim StepFun content parts before rendering

The StepFun trim workaround ran on the already-rendered messages, where
typed content parts have been concatenated into a single string, so the
per-part whitespace could no longer be reached. Move the trim ahead of
rendering and apply it to content_parts text as well as the string
content and reasoning_content. Adds a content-parts regression test.

Co-Authored-By: Piotr Wilkin ilintar@gmail.com
Assisted-By: Claude Fable 5 noreply@anthropic.com


Co-authored-by: tarruda tpadilha84@gmail.com

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9867

Choose a tag to compare

@github-actions github-actions released this 03 Jul 14:56
152d337

spec: support spec-draft-p-min in DFlash (#25246)

  • spec: support spec-draft-p-min in DFlash

  • dflash: add n_min guard

  • dflash: guard both n_min and n_max

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9866

Choose a tag to compare

@github-actions github-actions released this 03 Jul 14:20
75a48a9

cuda: enable topk-moe fusion for 288 experts (#25267)

  • cuda: enable topk-moe fusion for 288 experts

The topk-moe fusion only accepted power-of-2 expert counts (or the
special-cased 576), so models with 288 experts (e.g. Step-3.7-Flash)
fell back to the unfused per-layer routing chain: softmax/sigmoid,
argsort, get_rows, sum_rows, div, clamp, scale. At batch size 1 that
is ~330 extra tiny graph nodes per token.

288 is a multiple of the warp size, so the existing kernel already
handles it; this adds the missing template instantiation and accepts
288 in the eligibility check.

Measured on gfx1151 with Step-3.7-Flash IQ4_XS (llama-bench,
-b 4096 -ub 4096 -fa 1 -dio 1 -ctk q8_0 -ctv q8_0; machine idle,
before/after paired so pp4096 stays matched as a load control):

test | before | after
----------------+----------------+----------------
pp4096 | 460.99 ± 0.45 | 462.47 ± 0.34 (unchanged)
tg128 | 19.10 ± 0.04 | 19.56 ± 0.03 (+2.4%)
tg128 @ d30000 | 12.68 ± 0.04 | 12.69 ± 0.03 (unchanged)

Prompt processing is unaffected (the fusion only touches decode
routing). The decode gain is ~+2.4% at shallow context and fades with
depth: by 30k tokens each step is attention-bound over the KV cache,
so removing the fixed routing overhead is no longer visible.

Assisted-By: Claude Fable 5 noreply@anthropic.com

  • Update tests/test-backend-ops.cpp

Co-authored-by: Oliver Simons osimons@nvidia.com

  • Add comment for case 288 in topk-moe.cu

Co-authored-by: Oliver Simons osimons@nvidia.com

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9864

Choose a tag to compare

@github-actions github-actions released this 03 Jul 11:28
b5315e1

server + ui: ping silent SSE streams every 1s and kick only after 3s so slow prefill never drops healthy connections (#25241)

  • server + ui: ping silent SSE streams every 1s and kick only after 3s so slow prefill never drops healthy connections

  • server + ui: sse_ping_interval becomes a per-request body field

Address review from ngxson: the global default returns to 30 so API
clients see no behavior change, and the WebUI sends sse_ping_interval: 1
in the request body since it owns the 3s visibility-kick contract and
declares the cadence it needs. Positive values keep the existing > 0
gate, -1 keeps its disabled semantics.

  • server: move sse_ping_interval into the request schema

Address review from ngxson: the field is now a typed field_num with
hard limits (-1, INT32_MAX) bound to task_params, seeded from the CLI
default alongside the other inherited parameters. The raw json_value
read and its redundant comment are gone, and schema evaluation brings
type and range validation for free.

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9862

Choose a tag to compare

@github-actions github-actions released this 03 Jul 09:41
5a460de

Remove redundant CUDA copies after gated_delta_net. (#23940)

  • Remove redundant CUDA copies after gated_delta_net.

Currently, GDN writes recurrent state snapshots into its output tail, then the graph immediately copies those snapshots into ssm_states_all. With MTP draft length 3, target decode uses K=4, so that becomes 4 extra ggml_cuda_cpy calls.

The change detects that gated_delta_net -> view -> cpy pattern and makes the CUDA GDN kernel write the state snapshot(s) directly into the recurrent cache, skipping the intermediate tail writes and copy kernels when safe.

  • Address review comments

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9861

Choose a tag to compare

@github-actions github-actions released this 03 Jul 09:05
c8ae9a7

b9860

Choose a tag to compare

@github-actions github-actions released this 02 Jul 16:23
fdb1db8

llama : add llama_model_ftype_name() (#25134)

  • llama : add llama_model_ftype_name()

Expose the model file type (quantization) name, e.g. "Q8_0" or
"Q4_K - Medium", through a new public C API. The returned pointer is
valid for the lifetime of the model and nullptr when the model is
invalid or the file type is unknown.

Signed-off-by: Adrien Gallouët angt@huggingface.co

  • Export enum

Signed-off-by: Adrien Gallouët angt@huggingface.co

  • s/llama_model_ftype_name/llama_ftype_name/

Signed-off-by: Adrien Gallouët angt@huggingface.co

  • Move "(guessed)" to the front in llama_ftype_name

Prepend the "(guessed)" label instead of appending it. This allows removing
the non-thread-safe static std::string, making the function allocation-free.

Signed-off-by: Adrien Gallouët angt@huggingface.co

  • Add LLAMA_FTYPE_PREFIX

Signed-off-by: Adrien Gallouët angt@huggingface.co

  • Dont check for model

Signed-off-by: Adrien Gallouët angt@huggingface.co


Signed-off-by: Adrien Gallouët angt@huggingface.co

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9859

Choose a tag to compare

@github-actions github-actions released this 01 Jul 18:10
4fc4ec5

opencl: allow loading precompiled binary kernels from library (#23042)

  • opencl: allow loading binary kernel

  • opencl: add libdl.h

  • ggml-backend-dl is in ggml, which depends backend libs, thus
    ggml-opencl cannot depend on ggml-backend-dl

  • add libdl.h to break cyclic dep

  • opencl: allow loading bin kernel lib

  • opencl: load gemm_moe_mxfp4_f32_ns from kernel lib if available

  • opencl: load q8_0 gemm from kernel lib

  • opencl: load q4_0 moe gemm from kernel lib

  • opencl: load q4_1 moe gemm from kernel lib

  • opencl: load q4_k moe gemm from kernel lib

  • opencl: always declare get_adreno_bin_kernel_func_t

  • opencl: rephrase message

  • opencl: fix for rebase

  • opencl: update doc

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI: