Skip to content

Releases: ngxson/llama.cpp

b6020

29 Jul 07:03
0a5036b
Compare
Choose a tag to compare
CUDA: add roll (#14919)

* CUDA: add roll

* Make everything const, use __restrict__

b6018

28 Jul 16:57
bda6219
Compare
Choose a tag to compare
test-backend-ops : extend test case filtering (#14865)

* Extend test case filtering

1. Allow passing multiple (comma-separated?) ops to test-backend-ops. This can be convenient when working on a set of ops, when you'd want to test them together (but without having to run every single op). For example:

`test-backend-ops.exe test -o "ADD,RMS_NORM,ROPE,SILU,SOFT_MAX"`

2. Support full test-case variation string in addition to basic op names. This would make it easy to select a single variation, either for testing or for benchmarking. It can be particularly useful for profiling a particular variation (ex. a CUDA kernel), for example:

`test-backend-ops.exe perf -b CUDA0 -o "MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=2)"`

These two can be combined. As the current `-o`, this change doesn't try to detect/report an error if an filter doesn't name existing ops (ex. misspelled)

* Updating the usage help text

* Update tests/test-backend-ops.cpp

b6017

28 Jul 16:59
c556418
Compare
Choose a tag to compare
llama-bench : use local GPUs along with RPC servers (#14917)

Currently if RPC servers are specified with '--rpc' and there is a local
GPU available (e.g. CUDA), the benchmark will be performed only on the
RPC device(s) but the backend result column will say "CUDA,RPC" which is
incorrect. This patch is adding all local GPU devices and makes
llama-bench consistent with llama-cli.

b6016

28 Jul 16:11
db16e28
Compare
Choose a tag to compare
ggml-cpu : deduplicate scalar implementations (#14897)

* remove redundant code in riscv

* remove redundant code in arm

* remove redundant code in loongarch

* remove redundant code in ppc

* remove redundant code in s390

* remove redundant code in wasm

* remove redundant code in x86

* remove fallback headers

* fix x86 ggml_vec_dot_q8_0_q8_0

b6015

28 Jul 15:21
cd1fce6
Compare
Choose a tag to compare
SYCL: Add set_rows support for quantized types  (#14883)

* SYCL: Add set_rows support for quantized types

This commit adds support for GGML_OP_SET_ROWS operation for various
quantized tensor types (Q8_0, Q5_1, Q5_0, Q4_1, Q4_0, IQ4_NL) and BF16
type in the SYCL backend.

The quantization/dequantization copy kernels were moved from cpy.cpp
to cpy.hpp to make them available for set_rows.cpp.

This addresses part of the TODOs mentioned in the code.

* Use get_global_linear_id() instead

ggml-ci

* Fix formatting

ggml-ci

* Use const for ne11 and size_t variables in set_rows_sycl_q

ggml-ci

* Increase block size for q kernel to 256

ggml-ci

* Cleanup imports

* Add float.h to cpy.hpp

b6014

28 Jul 13:31
00fa15f
Compare
Choose a tag to compare
mtmd : add support for Voxtral (#14862)

* mtmd : add support for Voxtral

* clean up

* fix python requirements

* add [BEGIN_AUDIO] token

* also support Devstral conversion

* add docs and tests

* fix regression for ultravox

* minor coding style improvement

* correct project activation fn

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>

b6013

28 Jul 12:50
946b1f6
Compare
Choose a tag to compare
CUDA: fix pointer incrementation in FA (#14916)

b6012

28 Jul 12:08
6c6e397
Compare
Choose a tag to compare
model : add support for SmallThinker series (#14898)

* support smallthinker

* support 20b softmax, 4b no sliding window

* new build_moe_ffn_from_probs, and can run 4b

* fix 4b rope bug

* fix python type check

* remove is_moe judge

* remove set_dense_start_swa_pattern function and modify set_swa_pattern function

* trim trailing whitespace

* remove get_vocab_base of SmallThinkerModel in convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* better whitespace

Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* use GGML_ASSERT for expert count validation

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Improve null pointer check for probs

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* use template parameter for SWA attention logic

* better whitespace

Co-authored-by: Georgi Gerganov <[email protected]>

* move the creation of inp_out_ids before the layer loop

* remove redundant judge for probs

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>

b6011

28 Jul 10:35
afc0e89
Compare
Choose a tag to compare
sycl: refactor quantization to q8_1  (#14815)

* sycl: quantization to q8_1 refactor

* Refactored src1 copy logic in op_mul_mat

b6002

27 Jul 10:17
89d1029
Compare
Choose a tag to compare
vulkan : add fp16 support for the conv_2d kernel (#14872)

* add f16 to conv_2d testing
* weaken conv2d test error threshold