Releases · ngxson/llama.cpp

29 Jul 07:03

0a5036b

b6020 Latest

Latest

CUDA: add roll (#14919)

* CUDA: add roll

* Make everything const, use __restrict__

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-07-29T07:03:01Z
llama-b6020-bin-macos-arm64.zip

sha256:8950cbbb847d575696886edadf6aef35ce61b31f59ef34bd45b5b5331643591d

10.6 MB 2025-07-29T07:03:11Z
llama-b6020-bin-macos-x64.zip

sha256:fa010407872946bd9a99f8e790b37ce7b0cdb1504d4374427b6cc2764bf57719

27.2 MB 2025-07-29T07:03:12Z
llama-b6020-bin-ubuntu-vulkan-x64.zip

sha256:a61ef88c4c96cfd521452b126c414bdc1c0d9165ca23768e21b13852d659b82e

20.9 MB 2025-07-29T07:03:13Z
llama-b6020-bin-ubuntu-x64.zip

sha256:394f60ee86a5949c79f061f54a090c63f798cdd0ff9f5dd18cfad62b72765fc3

12.5 MB 2025-07-29T07:03:15Z
llama-b6020-bin-win-cpu-arm64.zip

sha256:9342570b83eb719b65d898c9c965380a45918d35218d1e539a42487254b00316

10.9 MB 2025-07-29T07:03:16Z
llama-b6020-bin-win-cpu-x64.zip

sha256:ee4c856a48a38216fe4784c8226305f65fc4b79d66ed64b45b4e796b3c7b8e1a

13.6 MB 2025-07-29T07:03:17Z
llama-b6020-bin-win-cuda-12.4-x64.zip

sha256:64a696f26a75bf8521989e9e5bf361024b8e06e61f44cd160ba031ec00850414

129 MB 2025-07-29T07:03:18Z
llama-b6020-bin-win-hip-radeon-x64.zip

sha256:c24bd4b71fc980e3cf2262928d53be41bc6960e1fca3170b42a247f2db094e94

298 MB 2025-07-29T07:03:22Z
llama-b6020-bin-win-opencl-adreno-arm64.zip

sha256:e013892e61f900f43fc325fe9753dffb05330b8bda882a2dd3db776bb4d9919d

11.2 MB 2025-07-29T07:03:31Z
Source code (zip)

2025-07-29T06:45:18Z
Source code (tar.gz)

2025-07-29T06:45:18Z

28 Jul 16:57

github-actions

b6018

bda6219

b6018

test-backend-ops : extend test case filtering (#14865)

* Extend test case filtering

1. Allow passing multiple (comma-separated?) ops to test-backend-ops. This can be convenient when working on a set of ops, when you'd want to test them together (but without having to run every single op). For example:

`test-backend-ops.exe test -o "ADD,RMS_NORM,ROPE,SILU,SOFT_MAX"`

2. Support full test-case variation string in addition to basic op names. This would make it easy to select a single variation, either for testing or for benchmarking. It can be particularly useful for profiling a particular variation (ex. a CUDA kernel), for example:

`test-backend-ops.exe perf -b CUDA0 -o "MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=2)"`

These two can be combined. As the current `-o`, this change doesn't try to detect/report an error if an filter doesn't name existing ops (ex. misspelled)

* Updating the usage help text

* Update tests/test-backend-ops.cpp

Assets 15

28 Jul 16:59

github-actions

b6017

c556418

b6017

llama-bench : use local GPUs along with RPC servers (#14917)

Currently if RPC servers are specified with '--rpc' and there is a local
GPU available (e.g. CUDA), the benchmark will be performed only on the
RPC device(s) but the backend result column will say "CUDA,RPC" which is
incorrect. This patch is adding all local GPU devices and makes
llama-bench consistent with llama-cli.

Assets 15

28 Jul 16:11

github-actions

b6016

db16e28

b6016

ggml-cpu : deduplicate scalar implementations (#14897)

* remove redundant code in riscv

* remove redundant code in arm

* remove redundant code in loongarch

* remove redundant code in ppc

* remove redundant code in s390

* remove redundant code in wasm

* remove redundant code in x86

* remove fallback headers

* fix x86 ggml_vec_dot_q8_0_q8_0

Assets 15

28 Jul 15:21

github-actions

b6015

cd1fce6

b6015

SYCL: Add set_rows support for quantized types  (#14883)

* SYCL: Add set_rows support for quantized types

This commit adds support for GGML_OP_SET_ROWS operation for various
quantized tensor types (Q8_0, Q5_1, Q5_0, Q4_1, Q4_0, IQ4_NL) and BF16
type in the SYCL backend.

The quantization/dequantization copy kernels were moved from cpy.cpp
to cpy.hpp to make them available for set_rows.cpp.

This addresses part of the TODOs mentioned in the code.

* Use get_global_linear_id() instead

ggml-ci

* Fix formatting

ggml-ci

* Use const for ne11 and size_t variables in set_rows_sycl_q

ggml-ci

* Increase block size for q kernel to 256

ggml-ci

* Cleanup imports

* Add float.h to cpy.hpp

Assets 15

28 Jul 13:31

github-actions

b6014

00fa15f

b6014

mtmd : add support for Voxtral (#14862)

* mtmd : add support for Voxtral

* clean up

* fix python requirements

* add [BEGIN_AUDIO] token

* also support Devstral conversion

* add docs and tests

* fix regression for ultravox

* minor coding style improvement

* correct project activation fn

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>

Assets 15

28 Jul 12:50

github-actions

b6013

946b1f6

b6013

CUDA: fix pointer incrementation in FA (#14916)

Assets 15

28 Jul 12:08

github-actions

b6012

6c6e397

b6012

model : add support for SmallThinker series (#14898)

* support smallthinker

* support 20b softmax, 4b no sliding window

* new build_moe_ffn_from_probs, and can run 4b

* fix 4b rope bug

* fix python type check

* remove is_moe judge

* remove set_dense_start_swa_pattern function and modify set_swa_pattern function

* trim trailing whitespace

* remove get_vocab_base of SmallThinkerModel in convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* better whitespace

Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* use GGML_ASSERT for expert count validation

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Improve null pointer check for probs

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* use template parameter for SWA attention logic

* better whitespace

Co-authored-by: Georgi Gerganov <[email protected]>

* move the creation of inp_out_ids before the layer loop

* remove redundant judge for probs

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>

Assets 15

28 Jul 10:35

github-actions

b6011

afc0e89

b6011

sycl: refactor quantization to q8_1  (#14815)

* sycl: quantization to q8_1 refactor

* Refactored src1 copy logic in op_mul_mat

Assets 15

27 Jul 10:17

github-actions

b6002

89d1029

b6002

vulkan : add fp16 support for the conv_2d kernel (#14872)

* add f16 to conv_2d testing
* weaken conv2d test error threshold

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ngxson/llama.cpp

b6020

Uh oh!

b6018

Uh oh!

b6017

Uh oh!

b6016

Uh oh!

b6015

Uh oh!

b6014

Uh oh!

b6013

Uh oh!

b6012

Uh oh!

b6011

Uh oh!

b6002

Uh oh!