Releases · ggml-org/llama.cpp

Release list

b9832

github-actions released this 28 Jun 14:18

b9832

f68a788

jinja: add --dump-prog for debugging (#25086)

jinja: add --dump-prog for debugging
Update common/jinja/runtime.cpp

Co-authored-by: Sigbjørn Skjæret 1629204+CISC@users.noreply.github.com

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

b9831

github-actions released this 28 Jun 13:38

b9831

d1b3425

spec : add DFlash support (#22105)

spec: add DFlash v2 support
dflash: support sliding window attention per layer_types
docs: add dflash section

Co-authored-by: Kashif Rasul kashif.rasul@gmail.com

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

b9830

github-actions released this 28 Jun 11:03

b9830

c1a1c8e

common : allow --offline in llama download (#25091)

Expose the existing --offline flag to llama download so a script can
run it to check whether a model is already cached and ready to be served
without touching the network.

Also fix a latent use-after-free in the URL-task on_done callback:
first_path is block-scoped and was captured by reference, but invoked
after the block ends.

Signed-off-by: Adrien Gallouët angt@huggingface.co

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

b9829

github-actions released this 28 Jun 06:46

b9829

27c8bb4

logs : reduce v2 (#25078)

server : reduce logs
cont : common
cont : spec
cont : CMN_ -> COM_

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

b9828

github-actions released this 27 Jun 23:15

b9828

ebd048f

opencl: flash attention improvement (#25069)

opencl: rework FA kernel for f16 and f32
opencl: flash-attention prefill prepass kernels

flash_attn_kv_pad_f16 pads the tail KV tile to a BLOCK_N multiple
flash_attn_mask_pad_f16 pads the matching mask tile
flash_attn_blk_f16 classifies each KV tile per query block as
fully masked / mixed / fully unmasked, so
the main kernel can skip fully-masked tiles
and the mask lookup for fully-unmasked ones

opencl: FA kernels for q4_0 and q8_0
opencl: set_rows for f32 to q8_0/q4_0
opencl: dequant kernels for q4_0 and q8_0
opencl: add FA tile tuning table with override
opencl: wire host side for FA
opencl: q4_0 MoE tensors are also SOA'ed
opencl: cosmetic fix
opencl: refactor, also clarify some code paths in comments
opencl: fix inifity for -cl-finite-math-only

Co-authored-by: Li He lih@qti.qualcomm.com

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

b9827

github-actions released this 27 Jun 12:49

b9827

0ed235e

[CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy (#25057)

[CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy

Add a CUDA ggml_cpy fast path for same-type, same-shape strided copies that are just 2D pitched block copies.
When tensors are not fully contiguous but each row is contiguous, it now uses cudaMemcpy2DAsync instead of the slow element-wise scalar copy kernel.

This fixes the GDN recurrent snapshot update with -np 4, where rollback slots are separated by cache stride gaps.

Add new tests that execute the new optimized strided copy path
Return unsupported for strided copy in OpenVINO, as new tests are failing

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

b9826

github-actions released this 27 Jun 11:00

b9826

9bebfcb

sycl : fix failed ut cases of norm (#25044)

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

b9825

github-actions released this 27 Jun 10:31

b9825

0b6529d

vulkan: fix step operator for 0 input (#25036)

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

b9824

github-actions released this 27 Jun 10:03

b9824

c299a92

binaries : Improve rpc-server and export-graph-ops names. (#25045)

Tests are generally prefixed with -test, so rename export-graph-ops
accordingly.

rpc-server is probably too generic a name for /usr/bin. Because it
should work with any ggml application, it is renamed to ggml-rpc-server.

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

b9823

github-actions released this 27 Jun 09:13

b9823

0275c0f

ci : add windows-openvino to check-release (#25022)

macOS/iOS:

macOS Apple Silicon (arm64)
macOS Apple Silicon (arm64, KleidiAI enabled) DISABLED
macOS Intel (x64)
iOS XCFramework

Linux:

Android:

Android arm64 (CPU)

Windows:

openEuler:

DISABLED
openEuler x86 (310p)
openEuler x86 (910b, ACL Graph)
openEuler aarch64 (310p)
openEuler aarch64 (910b, ACL Graph)

UI:

Assets 27

Uh oh!

Releases: ggml-org/llama.cpp

Release list

b9832

Uh oh!

b9831

Uh oh!

b9830

Uh oh!

b9829

Uh oh!

b9828

Uh oh!

b9827

Uh oh!

b9826

Uh oh!

b9825

Uh oh!

b9824

Uh oh!

b9823

Uh oh!