Skip to content

feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple)#10497

Open
localai-bot wants to merge 4 commits into
masterfrom
feat/llama-cpp-cpu-all-variants
Open

feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple)#10497
localai-bot wants to merge 4 commits into
masterfrom
feat/llama-cpp-cpu-all-variants

Conversation

@localai-bot

@localai-bot localai-bot commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

What

Replaces the per-microarch multi-binary C++ backend builds with a single grpc-server plus the set of dlopen-able libggml-cpu-* libraries produced by ggml's GGML_CPU_ALL_VARIANTS. ggml's backend registry probes the host CPU at runtime and loads the best variant, so the shell-side /proc/cpuinfo probing in run.sh is gone.

Applies to llama-cpp (x86 + arm64 + darwin/apple) and turboquant (x86 + arm64). ik-llama-cpp is intentionally excluded (see below).

Why

  • One build instead of four on x86; broader coverage than the hand-rolled set (x86 gains alderlake AVX-VNNI, zen4 AVX512-BF16, sapphirerapids AMX; arm64 gets armv8.0→armv9.2; darwin gets apple_m1/m2_m3/m4).
  • Smaller images, less bespoke runtime machinery.

How it links (the non-obvious parts)

CPU_ALL_VARIANTSGGML_BACKEND_DLBUILD_SHARED_LIBS=ON, so ggml/llama become shared objects. Three things make that coexist with the static-gRPC grpc-server:

  1. SHARED_LIBS is a make variable (default OFF). An appended -DBUILD_SHARED_LIBS=ON gets re-clobbered by the recursive sub-make into the VARIANT build dir; a command-line make variable propagates and wins.
  2. --target ggml is added — the per-microarch backends are runtime-dlopened, not link deps, so they only build via ggml's add_dependencies().
  3. hw_grpc_proto is pinned STATIC — under BUILD_SHARED_LIBS=ON it would become a DSO referencing the hidden-visibility symbols in static libprotobuf.a (ld: hidden symbol ... referenced by DSO). Keeping it static links gRPC/protobuf into the executable while only ggml/llama go shared. No PIC change, no base-grpc-* rebuild required.

turboquant copies llama-cpp's CMakeLists.txt + Makefile per flavor, so it inherits all three for free.

Per-platform specifics

  • arm64: ggml's armv9.2 SME variants use -march=...+sme, rejected by Ubuntu 24.04's default gcc-13. The arm64 compile step installs and uses gcc-14. (For CI's prebuilt base-grpc-arm64, gcc-14 should move into the base — follow-up.)
  • darwin: no bundled ld.so, so ggml scans the binary's own directory. ggml emits its loadable backends (CPU variants, metal, blas) with a .so suffix even on macOS, while core libs are .dylib.so go in the package root (for the scan), .dylib in lib/ (for DYLD_LIBRARY_PATH). Metal is preserved (GGML_METAL stays ON; --target ggml builds ggml-metal).

ik-llama-cpp: excluded

Its pinned ggml has zero CPU_ALL_VARIANTS support and its IQK kernels require AVX2, so the per-microarch dynamic backend set does not apply. Left unchanged (avx2/fallback).

Validation (all live inference, not just builds)

Target Build Smoke (LoadModel + Predict)
llama-cpp x86 ✅ selected AVX512-BF16 (zen) variant, coherent tokens
llama-cpp arm64 (dgx/Grace) ✅ 8 arm variants shipped ✅ selected SVE+MATMUL_INT8 variant, coherent tokens
llama-cpp darwin (M4) ✅ apple_m1/m2_m3/m4 + metal ✅ selected apple_m4 (SME=1) + Metal, coherent tokens
turboquant x86 ✅ 14 variants, 0 link errors (build/package verified)
turboquant arm64 🔄 in progress -

Each shipped image was inspected to confirm the variant set is present and cpu-all is NEEDED-linked against shared ggml with gRPC static; the old avx/avx2/avx512 binaries are gone.

🤖 Generated with Claude Code

mudler added 4 commits June 24, 2026 21:21
Replace the per-microarch avx/avx2/avx512/fallback multi-binary build on
x86 with a single grpc-server plus the dlopen-able libggml-cpu-*.so set
that ggml's backend registry selects at runtime by probing host CPU
features. One build instead of four, broader microarch coverage (adds
alderlake AVX-VNNI, zen4 AVX512-BF16, sapphirerapids AMX), and the
shell-side /proc/cpuinfo probing in run.sh goes away.

Build/link notes:
- CPU_ALL_VARIANTS requires GGML_BACKEND_DL + BUILD_SHARED_LIBS=ON, so
  ggml/llama become shared objects. SHARED_LIBS is now a make variable
  (default OFF) so the override survives the recursive sub-make into the
  VARIANT build dir instead of being re-clobbered by the base flags.
- The cpu-all target also builds "--target ggml": the per-microarch
  backends are runtime-dlopened, not link deps, so they only compile via
  ggml's add_dependencies().
- hw_grpc_proto is pinned STATIC. Under BUILD_SHARED_LIBS=ON it would
  otherwise become a DSO referencing hidden-visibility symbols in the
  static libprotobuf.a, which fails to link ("hidden symbol ... is
  referenced by DSO"). Keeping it static links gRPC/protobuf into the
  executable while only ggml/llama stay shared, so no PIC or base-image
  change is required.
- package.sh bundles the libggml-*.so set into package/lib; ggml finds
  them by scanning the bundled ld.so directory (/proc/self/exe), which
  run.sh launches from.

Scope: x86 only. arm64/darwin keep the single fallback build. The
ik-llama-cpp / turboquant forks and the other ggml C++ backends are
unchanged; the same recipe applies but is out of scope here.

Validated with a full docker build plus a live inference smoke test:
the model loads, ggml selects the AVX512_BF16 variant on a Zen-class
host, and tokens generate correctly.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
…uant

- llama-cpp: x86 AND arm64 now use the single llama-cpp-cpu-all build
  (only hipblas keeps the fallback build). ggml's arm64 variant table
  (armv8.x / armv9.x, plus apple_m* on darwin) is selected at runtime.
- turboquant: same recipe via a turboquant-cpu-all target. turboquant
  copies backend/cpp/llama-cpp's CMakeLists.txt + Makefile per flavor, so
  the hw_grpc_proto STATIC fix and the SHARED_LIBS / EXTRA_CMAKE_ARGS
  make-vars are inherited; the target just passes SHARED_LIBS=ON, the DL
  flags and --target ggml through, then collects the .so set. run.sh and
  package.sh updated to ship/select turboquant-cpu-all.
- Makefile lib-collection find now also matches *.dylib (for the darwin
  build, which emits dylibs rather than .so).

ik-llama-cpp is intentionally left unchanged: its pinned ggml has no
CPU_ALL_VARIANTS support and its IQK kernels require AVX2, so the
per-microarch dynamic backend set does not apply.

Scope still excludes the darwin packaging wiring (separate change).

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
…u-all packaging

- arm64: ggml CPU_ALL_VARIANTS builds armv9.2 SME variants whose -march=...+sme
  is rejected by the Ubuntu 24.04 default gcc-13. Build the arm64 variants with
  gcc-14 (installed in the compile step). The host only selects a variant it
  actually supports at runtime, but every variant must still compile.
- darwin: scripts/build/llama-cpp-darwin.sh builds llama-cpp-cpu-all instead of
  the fallback binary, keeps Metal (GGML_METAL stays ON; --target ggml also builds
  ggml-metal). The per-microarch libggml-cpu-*.dylib are placed in the package
  root next to the binary (darwin has no bundled ld.so, so ggml's executable-dir
  scan looks there), while the other shared dylibs go in lib/ for DYLD_LIBRARY_PATH.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
….dylib lib)

ggml emits its loadable backends (per-microarch CPU variants, metal, blas) with a
.so suffix even on darwin, while the core libraries (ggml-base/ggml/llama/
llama-common/mtmd) use .dylib. Split the distribution by suffix: .so DL backends
go in the package root for ggml's executable-directory scan, .dylib core libs go
in lib/ for DYLD_LIBRARY_PATH. The previous .dylib name-pattern matched none of the
variants.

Verified on an M4: ggml loads the apple_m4 CPU variant (SME=1) and Metal, model
loads and generates correct tokens.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
@localai-bot localai-bot changed the title feat(llama-cpp): single x86 CPU build via ggml CPU_ALL_VARIANTS feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple) Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants