Multi-backend improvements: SYCL, WebGPU, OpenCL, Hexagon, and CUDA enhancements by PazerOP · Pull Request #7 · wow-look-at-my/llama.cpp

PazerOP · 2026-06-07T07:50:00Z

Overview

This PR contains comprehensive improvements and bug fixes across multiple backend implementations:

SYCL Backend

Added mul_mat_vec_q_reorder_ncols template function for optimized multi-column matrix-vector quantized operations
Enhanced dequantization support with forward declarations for get_scale_min_k4
Improved vector dot product operations with additional template specializations

WebGPU Backend

Added flash attention configuration constants (GGML_WEBGPU_FLASH_ATTN_VEC_MAX_SEQ_LEN, GGML_WEBGPU_FLASH_ATTN_VEC_MAX_KV_TILE, GGML_WEBGPU_FLASH_ATTN_TILE_MAX_KV_TILE)
Refactored flash attention shader implementations with improved KV type handling
Added new shader templates for quantized flash attention staging

OpenCL Backend

Added new kernel implementations for Q5_0 and Q5_1 quantization formats (both flat and standard variants)
Added matrix multiplication kernels for Q5_0 and Q5_1 with L4 local memory optimization
Fixed uninitialized string variable in device context (opfilter_str)
Enhanced kernel registry with new quantization support

Hexagon Backend

Added F16 and F32 type support in get_x4x2_row_stride function
Fixed unsigned integer casting in dequantize_x4x2_weight_to_fp16_tiles_task_mxfp4
Refactored GDN context structure with improved VTCM memory management
Added HMX operations support with scattered mapping for MUL_MAT_ID
Added new utility headers for flash attention and power operations
Enhanced matmul context with matrix row mapping support

CUDA Backend

Improved flash attention kernel selection and configuration
Enhanced row reduction and quantization operations
Fixed memory alignment handling in buffer type operations

CPU Backend

Added RISC-V quantization optimizations
Added WebAssembly quantization support
Added KleidiaI integration with proper header includes

General Improvements

Added imatrix loader utility for calibration data handling
Enhanced model architecture support (added KIMI_LINEAR and MAINCODER architectures)
Improved chat template support with new model templates (LFM2.5-8B-A1B, Mellum)
Updated conversion scripts for better model support
Enhanced reasoning budget tracking
Improved security documentation

Additional information

This is a broad maintenance and feature enhancement PR addressing multiple backend implementations with focus on quantization support, flash attention optimization, and architecture compatibility improvements.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

https://claude.ai/code/session_017bc4yHk5d8DDL9WEeQ7XHt

* docs zendnn added information about Q8 support * docs zendnn rm unnecessary data * docs update, links to ZenDNN docs provided * docs zenDNN update: clarified explanation * docs zenDNN update: one more explanation clarified --------- Co-authored-by: plotnikov.v10 <plotnikov.v10@wb.ru>

…g#18756) * vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer) * lowercase defaults to true * type fix --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* remove redundant apple job openvino gpu and cpu test can share the same build and machine Update build-rpc.yml Update build-openvino.yml cpu any doesnt make sense as we have an arm job already, so do high perf on both x86 and arm remove duplicate x86 vulkan combine backend sampling Update server.yml run server on arm as windows is x86 * emdawn on one machine only * fix openvino, remove cpu tag as we dont have many x64 machines with that tag

* add to support Q1_0, NVFP4, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ1_S, IQ1_M, IQ3_S, IQ4_NL, IQ4_XS, I32, MXFP4, Q2_K, Q3_K, Q5_K, and Q6_K in GET_ROWS OP * correct the link

* support Q4_1, Q5_0, Q5_1 * update ut case

) Fixes: ggml-org#23927 (comment) The cpu-x64-high-perf job was missing the Linux label in its runs-on specification, causing the runner to not be discovered. All other self-hosted Linux jobs include this label. Assisted-by: llama.cpp:local pi

ggml-org#23949)

…l-org#23056) Q2_K/Q3_K/Q6_K do much better when using MMVQ on Intel BMG even though they're only 2-byte aligned, and Q3_K still wins on NVIDIA as well. mesa isn't all that great at coalescing back-to-back loads from alternating arrays, so we force it instead. Further, we can do subtraction directly on a full int32_t rather than an i8vec4 with bit twiddling because the high bit is always free to start. On Intel BMG on mesa, the switch to MMVQ provides an immediate ~57% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and ~78% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K. The futher switch to block loads leads to a ~24% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q3_K and a ~48% perf increase in tg128 for unsloth/Qwen3.5-9B-GGUF:Q6_K. Finally, Xe2 wins on MMVQ even for small k, so we take the NVIDIA override for K quants on Xe2 as well.

* Add EXAONE 4.5 and Add GQA for MMproj * mtmd: EXAONE 4.5 vision markers and projector path EXAONE 4.5 uses <vision> and </vision> for image boundaries; Qwen keeps <|vision_start|> and <|vision_end|>. Route EXAONE 4.5 through the Qwen2.5-VL-style encode path (window attention pattern, optional mmproj input norm). Update exaone4_5 projector weights and convert_hf_to_gguf for mmproj export. * mtmd: load EXAONE4 nextn tensors correctly Align EXAONE4 tensor registration with EXAONE_MOE for NextN/MTP slots and avoid skip-flag propagation on duplicated rope_freqs so model loading succeeds for EXAONE 4.5 GGUF. * Minor fixes * Address PR feedback * Address PR feedback * Fix EXAONE after merge * Fix EXAONE 4.5 conversion * Address PR feedback * Refactor EXAONE 4.5 conversion * Address PR feedback * Fix unintended deletion * Minor fix --------- Co-authored-by: LG-AI-EXAONE <exaonemodels@lgresearch.ai>

* TP: quantized KV cache support * fix partial view * remove overly strict assert

* vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer) * vocab : add normalizer.lowercase support to WPM * vocab : default normalizer.lowercase to false for whitespace pre-tokenizer

* vulkan: reduces lock contention * replace unique_lock with lock_guard

…rg#23641) * vulkan: don't hold the device mutex while compiling pipelines We need to hold a lock while we traverse all pipelines and lazily initialize them, but we don't need to hold it while the pipeline is being compiled. And it doesn't need to be the same lock as the device mutex. We call load_shaders each time a pipeline is needed, so we only need to compile that one pipeline (and, for example, don't want to end up compiling a pipeline that another thread should be compiling). * remove 'needed'

Drops the hardcoded f32 GLU kernels in favor of a single template. We now load/store in the native tensor type (half or float) to save memory bandwidth, but keep the actual ALU compute in float to avoid exploding math in geglu/swiglu. Also opened up the dispatch gate to allow f16 inputs.

* llama: save more VRAM by reserving n_outputs == n_seqs when possible * add n_outputs_per_seq * move n_outputs_max to server-context * change ubatch to batch everywhere

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* opencl: add general q5_0 support * opencl: add general q5_1 support * opencl: support non-uniform workgrp size --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

* nix: add nix-nodejs facilities to build Web UI Build the Web UI locally using standard Nix systems for building NodeJS packages. - Create derivation for the web UI - npm dependencies are imported via buildNodeModules. Does not require setting any shasum. - Copy build artifacts to the correct folders. - Prevents having to download from huggingface.co Fixes ggml-org#23067 * nix: simplify webui derivation using LLAMA_UI_OUT_DIR - Move npm build to installPhase with LLAMA_UI_OUT_DIR=$out to write output directly to the Nix store - Copy built assets to tools/ui/dist (source tree) instead of build/tools/ui/dist so CMake's copy_src_dist() finds them

…gml-org#23988) * speculative : add common_speculative_n_max helper function Extract the speculative max-draft-size logic from server_n_outputs_max into a reusable common_speculative_n_max() function in common/speculative. Assisted-by: llama.cpp:local pi * cont : draft context always has n_parallel outputs * llama : log n_outputs_max * speculative : remove draft-simple auto-enable * ci : enable server tests on PRs

* opencl: fix compiler warnings for non-adreno path * opencl: fix const cast warning

@ngxson

…rg#23971) * server: real-time reasoning interruption via control endpoint Builds on the manual reasoning budget trigger from ggml-org#23949. Adds a CONTROL task that mirrors the CANCEL path on the live slot and calls common_sampler_reasoning_budget_force to end thinking mid-generation. POST /v1/chat/completions/control with { id_slot, action }, opt-in reasoning_control arms the budget sampler on demand. Router and single model. Minimal WebUI button as a skeleton for further UI work. * ui: track reasoning phase via explicit streaming state Add isReasoning to the chat store, mirroring the isLoading pattern: per conversation map, private setter, public accessor and reactive export. Set from the stream callbacks, true on reasoning chunks, false on the first content chunk, reset on stream end and resynced on conversation switch. The skip button now keys off isReasoning so it shows only during the thinking phase, not the whole generation. * ui: extract control endpoint and action into constants Move the chat completion routes, the slots route and the reasoning control action out of chat.service into api-endpoints and a dedicated control-actions module. No behavior change, drops the magic strings so the control protocol has a single source of truth. * server: target reasoning control by completion id Address @ngxson review on the control endpoint. Switch from id_slot to the chat completion id to avoid a TOCTOU: the slot can be reassigned between the lookup and the control request, so matching the live completion (oaicompat_cmpl_id) is safe and a finished one simply matches nothing. Rename the action to reasoning_end, guard it on the reasoning_control flag of the target slot, and reduce the response to {success} with an optional message. * ui: target reasoning control by completion id Keep the streamed completion id on the message and post it back to the control endpoint instead of probing /slots. Drops the slot discovery and the TOCTOU that came with it. Action renamed to reasoning_end, response read as {success}. * server: address review from @ngxson Move the control fields into task_params and drop the redundant comments on the control path. * server: document the reasoning control endpoint * Update tools/ui/src/lib/types/database.d.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * ui: rename cmplId to completionId Per @allozaur review, clearer name for the streamed completion id. * ui: wire completion id capture through the agentic flow The webui streams through the agentic flow, which relayed onModel but not onCompletionId, so the completion id never reached the message and the control request was never sent. Relay it through the flow and its callbacks type, declare id on the chunk type, and log an explicit error when the button fires without a usable id. * ui: target reasoning control model from the message The model is a property of the completion, so read it from the streaming message like the id, not from the model dropdown which is unrelated UI state. Makes the request self-consistent by construction instead of just unlikely to drift. --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* feat: add video support for Qwen3.5 * various clean up * revise the design * fix llava-uhd case * nits * nits 2 --------- Co-authored-by: andrewmd5 <1297077+andrewmd5@users.noreply.github.com>

…gml-org#24234) * common/chat : fix LFM2 reasoning round-trip and stray <think> leak * Gate by reasoning format and whether the template supports <think>

Uprev the private fork to ggml-org/llama.cpp master (106 upstream commits). Brings the fork up to date while preserving the Gemma 4 MTP work and the fork's CI resource-reduction changes. Conflict resolutions: - src/llama-graph.h: keep both the fork's MTP graph outputs (t_h_pre_norm, t_argmax) and upstream's new native nextn output (t_h_nextn); they are independent tensors used by different code paths. - common/speculative.cpp: keep the fork's draft-model auto-enable/validation logic; restore the has_draft_model_path declaration the auto-merge dropped. - .github/workflows/build-openvino.yml: keep the fork's hosted-CPU-only matrix (drop upstream's self-hosted GPU runner) to avoid self-hosted infra. - .github/workflows/server.yml: keep upstream's backend-sampling test steps on the ubuntu job; keep the fork's removal of the Windows job (resource waste). https://claude.ai/code/session_017bc4yHk5d8DDL9WEeQ7XHt

Upstream changed llama_hparams::n_layer from a field to a method and replaced the swa_layers array with is_swa_impl (plus the is_swa()/n_layer() accessors). The merge updated gemma4.cpp automatically, but the fork-only gemma4-assistant.cpp still used the old API and failed to compile. Mirror gemma4.cpp: n_layer -> n_layer(), swa_layers -> is_swa_impl, and use n_layer_all for n_layer_kv_from_start. Verified: CPU build of llama/llama-server/test-llama-archs succeeds and test-llama-archs passes with gemma4_assistant registered. https://claude.ai/code/session_017bc4yHk5d8DDL9WEeQ7XHt

CI builds with -DLLAMA_FATAL_WARNINGS=ON, which turns -Wswitch into an error. The fork's COMMON_SPECULATIVE_TYPE_GEMMA4_MTP enum value was not handled in the common_speculative_n_max switch (upstream code that the merge brought in), failing the build. GEMMA4_MTP uses the draft config like DRAFT_MTP, so group it with the existing draft cases. Verified with a local -DLLAMA_FATAL_WARNINGS=ON build of llama, llama-common, llama-server and test-llama-archs (294/294, no warnings). https://claude.ai/code/session_017bc4yHk5d8DDL9WEeQ7XHt

This is a CUDA-only Gemma 4 MTP fork; the upstream llama.cpp CI matrix builds a dozen backends/arches that are irrelevant here and just burn CI minutes and add flaky failures to every PR. Extend the fork's existing 'disable to prevent resource waste' pattern to all non-CUDA builds. Disabled (renamed to .yml.disabled): build-3rd-party, snapdragon, android, cache, cmake-pkg, cpu, cross, msys, openvino, rpc, sanitize, virtgpu, vulkan, hip-quality-check. Trimmed the non-CUDA sibling jobs (hip, musa) out of build-cuda-ubuntu.yml and (hip) out of build-cuda-windows.yml, leaving only the cuda job. Kept: build-cuda-ubuntu/windows (cuda jobs), release/docker (manual), ui-build/ui-publish/ui, and the cheap lint/meta checks. https://claude.ai/code/session_017bc4yHk5d8DDL9WEeQ7XHt

Per the CUDA-only policy: server.yml builds llama-server on an ARM CPU runner and runs the integration tests, and server-sanitize.yml is a CPU ASAN/UBSAN server build. Neither is CUDA, so disable both. The cuda job in build-cuda-ubuntu.yml plus the cheap lint/meta/docs/UI checks remain. https://claude.ai/code/session_017bc4yHk5d8DDL9WEeQ7XHt

wow-look-at-my/llama.cpp#7 merged the upstream uprev into master, so the laguna arch patch now applies cleanly against LLAMA_CPP_VERSION=master. GitHub does not fire CI on a dependency repo's merge, and this token lacks actions:write to re-run, so push an empty commit to re-trigger the patches/linux (test.yaml) and docker build (docker-build.yaml) workflows. https://claude.ai/code/session_017bc4yHk5d8DDL9WEeQ7XHt

The uprev'd llama.cpp (master, via wow-look-at-my/llama.cpp#7) turned llama_hparams::n_layer from a field into a method, so laguna.cpp's get_key_or_arr call passing hparams.n_layer failed to compile: error: invalid use of non-static member function 'uint32_t llama_hparams::n_layer() const' Call the accessor (hparams.n_layer()). is_swa_impl and the rest of the file already use the current API; this was the only stale usage in llama/compat. Fixes the linux (CPU/CUDA) and docker build legs of #30. https://claude.ai/code/session_017bc4yHk5d8DDL9WEeQ7XHt

0cc4m and others added 30 commits May 31, 2026 08:17

llama: only use one iGPU device by default (ggml-org#23897)

22cadc1

ui: fix ETag truncation with MSVC compiler (ggml-org#23917)

3292da0

vocab : add tokenizer support for jina-embeddings-v2-base-zh (ggml-or…

d4c8e2c

…g#18756) * vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer) * lowercase defaults to true * type fix --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

ci : limit trigger paths for the CPU workflow (ggml-org#23938)

399739d

server : handle If-None-Match weak ETags (ggml-org#23916)

6f165c1

sycl : Optimize Q3_K mul_mat by reorder (ggml-org#23725)

44e211c

[SYCL] Add more types in GET_ROWS OP (ggml-org#23710)

4162522

* add to support Q1_0, NVFP4, IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ1_S, IQ1_M, IQ3_S, IQ4_NL, IQ4_XS, I32, MXFP4, Q2_K, Q3_K, Q5_K, and Q6_K in GET_ROWS OP * correct the link

[SYCL] Support Q4_1, Q5_0, Q5_1 in Flash-attention (ggml-org#23812)

a511424

* support Q4_1, Q5_0, Q5_1 * update ut case

common : support manually triggering the reasoning budget end sequence (

5254a79

ggml-org#23949)

vulkan: Removed unused functions (ggml-org#23175)

f8c0a19

security : disable private disclosures (ggml-org#23963)

02a5701

TP: quantized KV cache support (ggml-org#23792)

8e6fff8

* TP: quantized KV cache support * fix partial view * remove overly strict assert

vocab: add normalizer.lowercase support to WPM (ggml-org#23899)

5aba536

* vocab : add jina-embeddings-v2-base-zh (whitespace tokenizer) * vocab : add normalizer.lowercase support to WPM * vocab : default normalizer.lowercase to false for whitespace pre-tokenizer

vulkan: reduce host memory lock contention (ggml-org#23376)

bef69f1

* vulkan: reduces lock contention * replace unique_lock with lock_guard

llama: limit max outputs of llama_context (ggml-org#23861)

de6f727

* llama: save more VRAM by reserving n_outputs == n_seqs when possible * add n_outputs_per_seq * move n_outputs_max to server-context * change ubatch to batch everywhere

vendor : update cpp-httplib to 0.46.1 (ggml-org#23980)

335abed

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

opencl: add basic support for q5_0 and q5_1 (ggml-org#23548)

27d9ed8

* opencl: add general q5_0 support * opencl: add general q5_1 support * opencl: support non-uniform workgrp size --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

revert to using global_invocation_id for cpy shader (ggml-org#23955)

b8275a8

opencl: fix compiler warnings for non-adreno path (ggml-org#23922)

210a657

* opencl: fix compiler warnings for non-adreno path * opencl: fix const cast warning

clean up unused variables warnings (ggml-org#23975)

1fd5f48

ngxson and others added 4 commits June 6, 2026 21:17

mtmd: support "frame merge" for qwen-vl-based models (ggml-org#21858)

31e8249

* feat: add video support for Qwen3.5 * various clean up * revise the design * fix llava-uhd case * nits * nits 2 --------- Co-authored-by: andrewmd5 <1297077+andrewmd5@users.noreply.github.com>

common/chat : fix LFM2/LFM2.5 reasoning round-trip and <think> leak (g…

98d5e8b

…gml-org#24234) * common/chat : fix LFM2 reasoning round-trip and stray <think> leak * Gate by reasoning format and whether the template supports <think>

pr-minder Bot added the auto-pr-update label Jun 7, 2026

github-actions Bot added documentation Improvements or additions to documentation Apple Metal SYCL Nvidia GPU Vulkan testing examples devops python script server/ui server ggml model OpenCL Hexagon WebGPU nix labels Jun 7, 2026

claude added 3 commits June 7, 2026 08:02

PazerOP merged commit 48d22ce into master Jun 7, 2026
13 checks passed

PazerOP deleted the claude/lucid-shannon-c12Gl branch June 7, 2026 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-backend improvements: SYCL, WebGPU, OpenCL, Hexagon, and CUDA enhancements#7

Multi-backend improvements: SYCL, WebGPU, OpenCL, Hexagon, and CUDA enhancements#7
PazerOP merged 111 commits into
masterfrom
claude/lucid-shannon-c12Gl

PazerOP commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

PazerOP commented Jun 7, 2026

Overview

SYCL Backend

WebGPU Backend

OpenCL Backend

Hexagon Backend

CUDA Backend

CPU Backend

General Improvements

Additional information

Requirements

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants