[pull] master from ggml-org:master #393

pull · 2025-04-07T15:53:06Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

ggml-ci

* llama/ggml: add LLM training support more compact progress bar llama_save_model_to_file llama_opt_param_filter ggml_graph_dup force_grads refactor ggml_opt, fix test-opt * remove logits_all * refactor CUDA implementation for ACC * reset graph at beginning of opt period

* feat: Add GGUF conversion for granitemoeshared Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]> * feat: hparam and arch plumbing for granitemoeshared Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]> * fix: Split MoE fused tensors for shared experts in conversion Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]> * feat: First WIP cut at model arch in cpp The hparam and architecture plumbing should be correct, but the implementation of the shared experts seems to still be broken. Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]> * fix: Cleaner (maybe more correct?) splitting for gate/up Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]> * fix: Fix the input to the shared experts I had misread that the shared experts take the inputs _before_ the standard MoE layer and was feeding the output of the MoE to the shared experts. Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]> * fix: Avoid architecture-specific checks for Granite MoE Shared This is a cleaner way that will allow more flexibility in architecture strings going forward. Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]> * refactor: Split granite architectures out of llm_build_llama This helps de-clutter the llama-family graph construction and allows granite to diverge further (in preparation for Granite 4). NOTE: I removed the granite scale factors from llm_build_deci because they appear to only be there as copy-paste from llm_build_llama. The HF config does not seem to set those values: https://huggingface.co/Deci/DeciLM-7B/blob/main/config.json Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]> * fix: Fix compiler warning about uninitialized inp_pos This should not have been reachable, but it warns on some compliers Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]> * fix: Consoladate GraniteMoEShared into GraniteMoE for conversion Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]> * fix: Consolidate GraniteMoEShared into GraniteMoE on the c++ side Branch: GraniteMoEShared Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]>

….py (#13455)

…#13460) * mtmd : remove libllava, remove clip-quantize-cli * rm clip_model_quantize

Signed-off-by: Dan Johansson <[email protected]>

* batched-bench : fix pp batch contents * metal : optimize multi-sequence FA vec kernel ggml-ci

* batched-bench : fix pp batch contents * metal : optimize multi-sequence FA vec kernel ggml-ci * metal : use FA-vec kernel up to batch size 20 ggml-ci

* Update multimodal.md Minor change to include the huggingface link * Update docs/multimodal.md --------- Co-authored-by: Xuan-Son Nguyen <[email protected]>

* server: Allow pasting file from clipboard * server: Prevent default action on file paste * update build * format then build combined --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

* webui : use pako for more deterministic gzip compress * simpler code * use fflate instead of pako

This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more difficult for various reasons so I haven't done it. Performance for this shader is around 2.5x better than for the scalar shader when doing prompt processing. Some of the benefit may be from other optimizations like staging through shared memory, or splitting by rows.

* server : passthrough the /models endpoint during loading * server : update readme + return json for "meta" field

…nite (#13538) This matches how others do it, but will still avoid the extra initialization when rope is disabled. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

… diffs) (#13933) * server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat * update unit/test_tool_call.py::test_thoughts

* gemma : fix attn scale for 27B * cont : apply scale before attn * cont : consistent attention scaling

ggml-ci

* server : use swa-full fo draft context ggml-ci * server : disable speculative decoding for SWA models

* add concat, pad, repeat, tsembd, tanh, upscale * small fixes

* This is not needed by the normal use where the result is read using `tensor_get`, but it allows perf mode of `test-backend-ops` to properly measure performance.

* docs : add "Quick start" section for non-technical users * rm flox * Update README.md

ggml-ci

…se (#13996)

* kv-cache : refactor update mechanism ggml-ci * memory : improve status handling * defrag : reset head + add comments ggml-ci * cont : minor fixes ggml-ci

* * ggml-vulkan: adds op CONV_TRANSPOSE_1D * test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D * Missing barrier added to shader. Number of additional tests reduced to 108. * * Fixes typo in variable name. * Removes extra whitespaces. * Adds int64->int32 casts to prevent possible warnings. * Problem size reduced in tests to pass tests with llvmpipe. * supports_op condition moved from unintended position

ggml-ci

…N_VER to llama.cpp sources (#14013)

…4006) * memory : merge llama_kv_cache into llama_memory + new `llama_memory` API ggml-ci * context : fix casts ggml-ci

Replace CMAKE_CUDA_ARCHITECTURES=native with nvidia-smi detection as 'native' fails on autodl cloud environments. Co-authored-by: pockers21 <[email protected]>

…#14001) * allowing B580 and U9-288V * experimenting code to detect Xe2 * allowing coopmat only for Xe2 GPUs * fixed comment wording * fixed comment wording * removed unnecessary driver check

* add add_classifier_output_labels * use add_classifier_output_labels

* llama : deprecate llama_kv_self_ API ggml-ci * llama : allow llama_memory_(nullptr) ggml-ci * memory : add flag for optional data clear in llama_memory_clear ggml-ci

… suffix) (#14050)

* SYCL: Implement few same quantized type copy kernels * Use memcpy for copying contiguous tensors ggml-ci * feat(sycl): add contiguous tensor copy support and device checks Adds a memcpy path for contiguous tensors of the same type to optimize data transfer. Updates device support checks to recognize contiguous tensor operations, improving compatibility and performance. * refactor: replace specific block copy functions with template The changes replace multiple redundant block copy functions (e.g., cpy_block_q8_0_q8_0, cpy_block_q5_0_q5_0) with a single templated function cpy_blck_q_q. This reduces code duplication by using a generic template that works for any block type, improving maintainability while preserving the same functionality. The template is instantiated with specific block types (e.g., block_q8_0) where needed. * Exclude BF16 support for COPY tensors for now ggml-ci * perf: adjust SYCL copy kernel block sizes for efficiency Use ceil_div to ensure full element coverage and update nd_range parameters to better align with SYCL block sizes, improving parallelism and device utilization in copy operations.

pull bot added the ⤵️ pull label Apr 7, 2025

anudit and others added 29 commits May 12, 2025 13:56

server : allow content to be null in oaicompat_completion_params_parse (

91159ee

#13477)

context : fix state io for memory-less contexts (#13470)

064cc59

ggml-ci

clip : cap max image size 1024 for qwen vl model (#13478)

de4c07f

opencl: remove unnecessary assert for add (#13257)

f0d46ef

llama-bench : add defrag-thold, check for invalid ranges (#13487)

cf0a43b

sync : ggml

1e2809b

scripts : support arbitrary input file formats in compare-llama-bench…

bf79371

….py (#13455)

mtmd : remove libllava, remove clip-quantize-cli (⚠️ breaking change) (…

b472634

…#13460) * mtmd : remove libllava, remove clip-quantize-cli * rm clip_model_quantize

batched-bench : fix pp batch contents (#13492)

b89d605

ggml-cpu: Update KleidiAI to v1.6 and fix include directives (#13509)

4f711af

Signed-off-by: Dan Johansson <[email protected]>

metal : optimize multi-sequence FA vec kernel (#13493)

c252e0c

* batched-bench : fix pp batch contents * metal : optimize multi-sequence FA vec kernel ggml-ci

metal : use FA-vec kernel up to batch size 20 (#13496)

f0995d2

* batched-bench : fix pp batch contents * metal : optimize multi-sequence FA vec kernel ggml-ci * metal : use FA-vec kernel up to batch size 20 ggml-ci

clip : clip.h become private API (⚠️ breaking change) (#13510)

71bdbdb

quantize : improve tensor-type pattern matching (#13033)

e5c834f

vulkan: workaround FA compile failures on macos (#13517)

ab3971f

scripts : fix compare-llama-bench.py show parameter (#13514)

be1d4a1

docs: Update link to ggml-org in multimodal.md (#13513)

21ca987

* Update multimodal.md Minor change to include the huggingface link * Update docs/multimodal.md --------- Co-authored-by: Xuan-Son Nguyen <[email protected]>

webui: Allow pasting file from clipboard (#13526)

d486dd3

* server: Allow pasting file from clipboard * server: Prevent default action on file paste * update build * format then build combined --------- Co-authored-by: Xuan Son Nguyen <[email protected]>

webui : use fflate for more deterministic gzip compress (#13525)

bb1681f

* webui : use pako for more deterministic gzip compress * simpler code * use fflate instead of pako

cmake: simplify vulkan shader test logic (#13263)

09d13d9

server : fix cache_tokens bug with no cache_prompt (#13533)

360a9c9

server : passthrough the /models endpoint during loading (#13535)

0531744

* server : passthrough the /models endpoint during loading * server : update readme + return json for "meta" field

fix: Move build_inp_pos to the top of the graph section for build_gra…

5e7d95e

…nite (#13538) This matches how others do it, but will still avoid the extra initialization when rope is disabled. Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

CUDA: faster Deepseek FA, add Turing support (#13435)

6da34fa

llama : fix quantize with dl backends (#13539)

b7d2672

CUDA: fix crash on large batch size for quant. MoE (#13537)

4696d56

ochafik and others added 30 commits June 2, 2025 10:15

server: update deepseek reasoning format (pass reasoning_content as…

c9bbc77

… diffs) (#13933) * server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat * update unit/test_tool_call.py::test_thoughts

gemma : more consistent attention scaling for v2 and v3 (#13951)

5582c49

* gemma : fix attn scale for 27B * cont : apply scale before attn * cont : consistent attention scaling

metal : use F32 accumulators in FA kernels (#13975)

ea394d7

ggml-ci

server : disable speculative decoding for SWA models (#13970)

3637576

* server : use swa-full fo draft context ggml-ci * server : disable speculative decoding for SWA models

OpenCL: Add concat, tsembd, upscale, tanh, pad and repeat (#13840)

bfb1e01

* add concat, pad, repeat, tsembd, tanh, upscale * small fixes

opencl: add backend_synchronize (#13939)

71e74a3

* This is not needed by the normal use where the result is read using `tensor_get`, but it allows perf mode of `test-backend-ops` to properly measure performance.

docs : add "Quick start" section for new users (#13862)

ea1431b

* docs : add "Quick start" section for non-technical users * rm flox * Update README.md

vulkan: fix warnings in perf logger querypool code (#13937)

7e00e60

kv-cache : fix unified::seq_rm to work with seq_id < 0 (#13985)

e0e806f

ggml-ci

CUDA: fix FTZ in FA for Gemma 3 (#13991)

0b4be4c

llama-graph : use ggml_repeat_4d (#13998)

3ac6753

releases : use dl backend for linux release, remove arm64 linux relea…

4825487

…se (#13996)

ci : remove cuda 11.7 releases, switch runner to windows 2022 (#13997)

2589ad3

kv-cache : refactor the update/defrag mechanism (#13988)

3e63a58

* kv-cache : refactor update mechanism ggml-ci * memory : improve status handling * defrag : reset head + add comments ggml-ci * cont : minor fixes ggml-ci

vulkan: automatically deduce size of push constants (#13936)

5a8ae30

context : fix pos_min initialization upon error decode (#14008)

9e31bec

ggml-ci

vocab : warn about missing mask token (#14022)

9f47fa5

readme : add badge (#13938)

d01d112

llama : allow using mmap without PrefetchVirtualMemory, apply GGML_WI…

3a07714

…N_VER to llama.cpp sources (#14013)

memory : migrate from llama_kv_cache to more generic llama_memory (#1…

7f37b6c

…4006) * memory : merge llama_kv_cache into llama_memory + new `llama_memory` API ggml-ci * context : fix casts ggml-ci

ci: fix CUDA build failure on autodl cloud machines (#14005)

146b88e

Replace CMAKE_CUDA_ARCHITECTURES=native with nvidia-smi detection as 'native' fails on autodl cloud environments. Co-authored-by: pockers21 <[email protected]>

vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs (…

669c13e

…#14001) * allowing B580 and U9-288V * experimenting code to detect Xe2 * allowing coopmat only for Xe2 GPUs * fixed comment wording * fixed comment wording * removed unnecessary driver check

gguf-py : add add_classifier_output_labels method to writer (#14031)

1caae7f

* add add_classifier_output_labels * use add_classifier_output_labels

llama : support multiple classifier outputs and labels (#13940)

d17a809

context : fix SWA-related warning for multiple sequences (#14045)

487a5e0

llama : deprecate llama_kv_self_ API (#14030)

745aa53

* llama : deprecate llama_kv_self_ API ggml-ci * llama : allow llama_memory_(nullptr) ggml-ci * memory : add flag for optional data clear in llama_memory_clear ggml-ci

llama : fix llama_model_chat_template with template name (LLM_KV with…

0974ad7

… suffix) (#14050)

ci: add LoongArch cross-compile build (#13944)

5787b5d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from ggml-org:master #393

[pull] master from ggml-org:master #393

Uh oh!

pull bot commented Apr 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

[pull] master from ggml-org:master #393

Are you sure you want to change the base?

[pull] master from ggml-org:master #393

Uh oh!

Conversation

pull bot commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pull bot commented Apr 7, 2025 •

edited

Loading