Skip to content

Releases: ggml-org/llama.cpp

b9832

Choose a tag to compare

@github-actions github-actions released this 28 Jun 14:18
f68a788

jinja: add --dump-prog for debugging (#25086)

  • jinja: add --dump-prog for debugging

  • Update common/jinja/runtime.cpp

Co-authored-by: Sigbjørn Skjæret 1629204+CISC@users.noreply.github.com


Co-authored-by: Sigbjørn Skjæret 1629204+CISC@users.noreply.github.com

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9831

Choose a tag to compare

@github-actions github-actions released this 28 Jun 13:38
d1b3425

spec : add DFlash support (#22105)

  • spec: add DFlash v2 support

  • dflash: support sliding window attention per layer_types

  • docs: add dflash section


Co-authored-by: Kashif Rasul kashif.rasul@gmail.com

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9830

Choose a tag to compare

@github-actions github-actions released this 28 Jun 11:03
c1a1c8e

common : allow --offline in llama download (#25091)

Expose the existing --offline flag to llama download so a script can
run it to check whether a model is already cached and ready to be served
without touching the network.

Also fix a latent use-after-free in the URL-task on_done callback:
first_path is block-scoped and was captured by reference, but invoked
after the block ends.

Signed-off-by: Adrien Gallouët angt@huggingface.co

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9829

Choose a tag to compare

@github-actions github-actions released this 28 Jun 06:46
27c8bb4

logs : reduce v2 (#25078)

  • server : reduce logs

  • cont : common

  • cont : spec

  • cont : CMN_ -> COM_

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9828

Choose a tag to compare

@github-actions github-actions released this 27 Jun 23:15
ebd048f

opencl: flash attention improvement (#25069)

  • opencl: rework FA kernel for f16 and f32

  • opencl: flash-attention prefill prepass kernels

  • flash_attn_kv_pad_f16 pads the tail KV tile to a BLOCK_N multiple
  • flash_attn_mask_pad_f16 pads the matching mask tile
  • flash_attn_blk_f16 classifies each KV tile per query block as
    fully masked / mixed / fully unmasked, so
    the main kernel can skip fully-masked tiles
    and the mask lookup for fully-unmasked ones
  • opencl: FA kernels for q4_0 and q8_0

  • opencl: set_rows for f32 to q8_0/q4_0

  • opencl: dequant kernels for q4_0 and q8_0

  • opencl: add FA tile tuning table with override

  • opencl: wire host side for FA

  • opencl: q4_0 MoE tensors are also SOA'ed

  • opencl: cosmetic fix

  • opencl: refactor, also clarify some code paths in comments

  • opencl: fix inifity for -cl-finite-math-only


Co-authored-by: Li He lih@qti.qualcomm.com

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9827

Choose a tag to compare

@github-actions github-actions released this 27 Jun 12:49
0ed235e

[CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy (#25057)

  • [CUDA] Added a cudaMemcpy2DAsync fast path to ggml_cuda_cpy

Add a CUDA ggml_cpy fast path for same-type, same-shape strided copies that are just 2D pitched block copies.
When tensors are not fully contiguous but each row is contiguous, it now uses cudaMemcpy2DAsync instead of the slow element-wise scalar copy kernel.

This fixes the GDN recurrent snapshot update with -np 4, where rollback slots are separated by cache stride gaps.

  • Add new tests that execute the new optimized strided copy path

  • Return unsupported for strided copy in OpenVINO, as new tests are failing

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9826

Choose a tag to compare

@github-actions github-actions released this 27 Jun 11:00
9bebfcb

b9825

Choose a tag to compare

@github-actions github-actions released this 27 Jun 10:31
0b6529d

b9824

Choose a tag to compare

@github-actions github-actions released this 27 Jun 10:03
c299a92

binaries : Improve rpc-server and export-graph-ops names. (#25045)

Tests are generally prefixed with -test, so rename export-graph-ops
accordingly.

rpc-server is probably too generic a name for /usr/bin. Because it
should work with any ggml application, it is renamed to ggml-rpc-server.

macOS/iOS:

Linux:

Android:

Windows:

openEuler:

  • DISABLED
  • openEuler x86 (310p)
  • openEuler x86 (910b, ACL Graph)
  • openEuler aarch64 (310p)
  • openEuler aarch64 (910b, ACL Graph)

UI:

b9823

Choose a tag to compare

@github-actions github-actions released this 27 Jun 09:13
0275c0f