sync : ggml by ggerganov · Pull Request #3652 · ggml-org/whisper.cpp

ggerganov · 2026-02-07T08:40:11Z

No description provided.

…llama/19194)

* Add Q8_0 OpenCL kernel Co-authored-by: yunjie <yunjie@qti.qualcomm.com> * opencl: fix build for non-adreno * opencl: refactor q8_0 * opencl: enforce subgroup size of 64 for adreno for q8_0 * For A750 and older generations, subgroup size can be 64 or 128. This kernel assumes subgroup size 64. * opencl: suppress warning when adreno kernels are disabled --------- Co-authored-by: yunjie <yunjie@qti.qualcomm.com> Co-authored-by: Li He <lih@qti.qualcomm.com>

* wip * ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation * ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations * wip * ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance * ggml-hexagon: refactor dot product functions to use a common loading function for improved readability * optimize vector dot product functions to use unified reduction for improved performance * wip * ggml-hexagon: add vectorized dot product function for FP32 and FP16 accumulation * ggml-hexagon: optimize dot product functions for FP16 and FP32 with new vectorized implementations * wip * ggml-hexagon: optimize hvx_vec_dump_f32_n and hvx_vec_reduce_sum_qf32x2 functions for improved performance * ggml-hexagon: refactor dot product functions to use a common loading function for improved readability * optimize vector dot product functions to use unified reduction for improved performance * hexagon: optimize reduce-sum for v75+ * hexagon: always keep row_sums in sf/fp32 * ggml-hexagon: enhance directory checks for HEXAGON_SDK_ROOT and HEXAGON_TOOLS_ROOT * fix compiling error after rebase --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

…ma/19188) * Bump max cmake version (needed for Windows on Snapdragon builds) * cmake: move max version setting into ggml/CMakeLists

* Remove mutex for pipeline caches, since they are now per-thread. * Add comment * Run clang-format * Cleanup * Run CI again * Run CI once more * Run clang-format

* Update old URLs to github.com/ggml-org/ * Bump copyrights

* metal : support virtual devices * cont : manage buffer type context memory * metal : add events * cont : implement cpy_tensor_async

…idia & AMD GPU is unavailable: download/installation channels are out of work. (llama/19246) User can't build up the software for Nvidia & AMD GPU. rm the oneMath since it is only used in NV and AMD code path.

* ggml-cpu: split across kv for faster TG * simplify sinks application * add ref impl

* opencl: refactor concat * opencl: refactor repeat * opencl: refactor tanh * opencl: enable fp16 for tanh * opencl: refactor scale * opencl: fix unused variables

…lama/19227) Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.

Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.

…053) By providing stride_* variables as size_t (i.e., 64-bit) the compiler can correctly unroll the [two for-loops](https://github.com/ggml-org/llama.cpp/blob/557515be1e93ed8939dd8a7c7d08765fdbe8be31/ggml/src/ggml-cuda/mmq.cuh#L3789-L3816) on BW. This gives some perf for prefill/pp phase on BW, while not affecting other SMs: | GPU | Model | Test | t/s master | t/s osimons/fix_bw_mmq_fixup_kernel | Speedup | |:--------------------------------------------------------|:----------------------|:-------|-------------:|--------------------------------------:|----------:| | NVIDIA RTX 6000 Ada Generation | gpt-oss 20B MXFP4 MoE | pp8096 | 8404.05 | 8375.79 | 1.00 | | NVIDIA RTX 6000 Ada Generation | llama 3B Q4_K_M | pp8096 | 16148.93 | 16019.60 | 0.99 | | NVIDIA RTX 6000 Ada Generation | llama 8B Q4_0 | pp8096 | 8008.29 | 7978.80 | 1.00 | | NVIDIA RTX 6000 Ada Generation | nemotron_h 9B BF16 | pp8096 | 4263.16 | 4248.53 | 1.00 | | NVIDIA RTX 6000 Ada Generation | nemotron_h 9B Q4_K_M | pp8096 | 5165.11 | 5157.43 | 1.00 | | NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | gpt-oss 20B MXFP4 MoE | pp8096 | 12582.80 | 12758.37 | 1.01 | | NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 3B Q4_K_M | pp8096 | 16879.10 | 17619.47 | 1.04 | | NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | llama 8B Q4_0 | pp8096 | 10649.90 | 10982.65 | 1.03 | | NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B BF16 | pp8096 | 7717.73 | 7716.22 | 1.00 | | NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition | nemotron_h 9B Q4_K_M | pp8096 | 7301.90 | 7370.38 | 1.01 |

* CUDA: use mmvq for mul-mat-id for small batch sizes * add mmvq too * Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs * templatize multi_token_path

* ggml-cpu: use LUT for converting e8->f32 scales on x86 * add dispatch based on macro

* ggml-virtgpu: regenerate_remoting.py: add the ability to deprecate a function * ggml-virtgpu: deprecate buffer_type is_host remoting not necessary * ggml-virtgpu: stop using static vars as cache The static init isn't thread safe. * ggml-virtgpu: protect the use of the shared memory to transfer data * ggml-virtgpu: make the remote calls thread-safe * ggml-virtgpu: backend: don't continue if couldn't allocate the tensor memory * ggml-virtgpu: add a cleanup function for consistency * ggml-virtgpu: backend: don't crash if buft->iface.get_max_size is missing * fix style and ordering * Remove the static variable in apir_device_get_count * ggml-virtgpu: improve the logging * fix review minor formatting changes

* vulkan: fix GPU deduplication logic. As reported in ggml-org/llama.cpp#19221, the (same uuid, same driver) logic is problematic for windows+intel igpu. Let's just avoid filtering for MoltenVK which is apple-specific, and keep the logic the same as before 88d23ad5 - just dedup based on UUID. Verified that MacOS + 4xVega still reports 4 GPUs with this version. * vulkan: only skip dedup when both drivers are moltenVk

…/19281) Write out a 2-bit code per block and avoid loading the mask when it matches these two common cases. Apply this optimization when the mask is relatively large (i.e. prompt processing).

…9369)

* metal : skip loading all-zero mask * cont : minor

* vulkan: make FA mask/softcap enables spec constants * don't specialize for sinks * bump timeout a little bit

…llama/19376) The cpu and cuda backends use fp16 for the VKQ accumulator type, this change does the same for vulkan. This helps particularly with large head sizes which are very register-limited. I tried this for the coopmat1 path and it slowed down a bit. I didn't try for scalar. I applied the softmax bias that the cuda backend uses to avoid overflow, although I was not able to reproduce the original bug without it.

* Fix SYCL CEIL operator * sycl: implement GGML_OP_CEIL

… (llama/19310) * ggml webgpu: port binary operators to use pre-wgsl * Add binary.wgsl: unified shader with conditionals for all 4 ops * Add gen_binary_shaders.cpp: build tool for using pre_wgsl preprocessor * Remove bin_op.tmpl.wgsl and binary.wgsl (Python template) * Update CMake to generate binary operator shaders at build time * ggml-webgpu: migrate binary ops to JIT compilation with overlap handling * port binary operators from AOT to pre-wgsl JIT compilation * add src1=dst overlap handling for binary ops * use compile-time workgroup size defines instead of runtime overrides * ggml-webgpu: complete overlap handling for binary ops * add support for inplace & overlap case in binding setup * restructure conditional logic to handle all overlap cases * ensure all buffer bindings are correctly assigned for edge cases * ggml-webgpu: remove unused binary overlap cases Remove src0==src1 binary overlap case that never occurs in practice. * keep INPLACE (src0==dst), OVERLAP (src1==dst), DEFAULT * remove unused src0==src1 and all-same variant * refactor wgsl to eliminate duplication

* metal : refactor bin kernels * cont * cont : fix cv

ggerganov and others added 30 commits February 7, 2026 10:38

cmake : remove unused file (ggml/1419)

f5a8ff6

ggml : bump version to 0.9.6 (ggml/1423)

9e51bff

Correctly fetch q8_1 quantize pipeline in test as needed by 8a3519b (…

4232eb4

…llama/19194)

Bump cmake max version (needed for Windows on Snapdragon builds) (lla…

daa79d4

…ma/19188) * Bump max cmake version (needed for Windows on Snapdragon builds) * cmake: move max version setting into ggml/CMakeLists

Remove pipeline cache mutexes (llama/19195)

ea25dc8

* Remove mutex for pipeline caches, since they are now per-thread. * Add comment * Run clang-format * Cleanup * Run CI again * Run CI once more * Run clang-format

docs : Minor cleanups (llama/19252)

1d48203

* Update old URLs to github.com/ggml-org/ * Bump copyrights

ggml-backend: fix async set/get fallback sync (llama/19179)

a14f9b3

metal : support virtual devices (llama/18919)

d3d7370

* metal : support virtual devices * cont : manage buffer type context memory * metal : add events * cont : implement cpy_tensor_async

sycl: implement GGML_OP_TOP_K (llama/19242)

00f1eed

Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nv…

80a022a

…idia & AMD GPU is unavailable: download/installation channels are out of work. (llama/19246) User can't build up the software for Nvidia & AMD GPU. rm the oneMath since it is only used in NV and AMD code path.

ggml-cpu: FA split across kv for faster TG (llama/19209)

a9417b6

* ggml-cpu: split across kv for faster TG * simplify sinks application * add ref impl

opencl: refactor some ops, concat, repeat, tanh and scale (llama/19226)

eb112b0

* opencl: refactor concat * opencl: refactor repeat * opencl: refactor tanh * opencl: enable fp16 for tanh * opencl: refactor scale * opencl: fix unused variables

ggml: added cleanups in ggml_quantize_free (llama/19278)

f28a02c

Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.

metal : minor cleanup (llama/19251)

8adee28

CUDA: use mmvq for mul-mat-id for small batch sizes (llama/18958)

49da98a

* CUDA: use mmvq for mul-mat-id for small batch sizes * add mmvq too * Fix perf issue on ampere. Use mmvf mm-id only for non-nvidia GPUs * templatize multi_token_path

vulkan: disable coopmat1 fa on Nvidia Turing (llama/19290)

12c470f

metal : add solve_tri (llama/19302)

6e83e87

ggml-cpu: use LUT for converting e8->f32 scales on x86 (llama/19288)

569b39f

* ggml-cpu: use LUT for converting e8->f32 scales on x86 * add dispatch based on macro

metal : add missing includes (llama/19348)

ce06ddb

vulkan: fix non-contig rope (llama/19299)

a51e5f8

vulkan: Set k_load_shmem to false when K is too large (llama/19301)

c109f72

metal : add diag (llama/19330)

f709bc5

vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (llama…

c1bed7e

…/19281) Write out a 2-bit code per block and avoid loading the mask when it matches these two common cases. Apply this optimization when the mask is relatively large (i.e. prompt processing).

metal : adaptive CPU/GPU interleave based on number of nodes (llama/1…

4d1e026

…9369)

ggerganov and others added 10 commits February 7, 2026 10:38

cuda : cuda graphs now compare all node params (llama/19383)

33c0bcc

metal : skip loading all-zero mask (llama/19337)

9af583d

* metal : skip loading all-zero mask * cont : minor

vulkan: make FA mask/softcap enables spec constants (llama/19309)

6d777d6

* vulkan: make FA mask/softcap enables spec constants * don't specialize for sinks * bump timeout a little bit

sycl: add F16 support for GGML_OP_CEIL (llama/19306)

5796003

* Fix SYCL CEIL operator * sycl: implement GGML_OP_CEIL

metal : fix event synchronization in cpy_tensor_async (llama/19402)

22f2851

metal : consolidate bin kernels (llama/19390)

0a356e0

* metal : refactor bin kernels * cont * cont : fix cv

sync : ggml

a5a33c0

talk-llama : sync llama.cpp

a4720db

danbev approved these changes Feb 7, 2026

View reviewed changes

ggerganov merged commit 4b23ff2 into master Feb 8, 2026
56 of 66 checks passed

ggerganov deleted the sync-ggml-26-02-07 branch February 8, 2026 07:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync : ggml#3652

sync : ggml#3652
ggerganov merged 40 commits into
masterfrom
sync-ggml-26-02-07

ggerganov commented Feb 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

ggerganov commented Feb 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants