feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple) by localai-bot · Pull Request #10497 · mudler/LocalAI

localai-bot · 2026-06-24T21:21:37Z

What

Replaces the per-microarch multi-binary C++ backend builds with a single grpc-server plus the set of dlopen-able libggml-cpu-* libraries produced by ggml's GGML_CPU_ALL_VARIANTS. ggml's backend registry probes the host CPU at runtime and loads the best variant, so the shell-side /proc/cpuinfo probing in run.sh is gone.

Applies to llama-cpp (x86 + arm64 + darwin/apple) and turboquant (x86 + arm64). ik-llama-cpp is intentionally excluded (see below).

Why

One build instead of four on x86; broader coverage than the hand-rolled set (x86 gains alderlake AVX-VNNI, zen4 AVX512-BF16, sapphirerapids AMX; arm64 gets armv8.0→armv9.2; darwin gets apple_m1/m2_m3/m4).
Smaller images, less bespoke runtime machinery.

How it links (the non-obvious parts)

CPU_ALL_VARIANTS → GGML_BACKEND_DL → BUILD_SHARED_LIBS=ON, so ggml/llama become shared objects. Three things make that coexist with the static-gRPC grpc-server:

SHARED_LIBS is a make variable (default OFF). An appended -DBUILD_SHARED_LIBS=ON gets re-clobbered by the recursive sub-make into the VARIANT build dir; a command-line make variable propagates and wins.
--target ggml is added — the per-microarch backends are runtime-dlopened, not link deps, so they only build via ggml's add_dependencies().
hw_grpc_proto is pinned STATIC — under BUILD_SHARED_LIBS=ON it would become a DSO referencing the hidden-visibility symbols in static libprotobuf.a (ld: hidden symbol ... referenced by DSO). Keeping it static links gRPC/protobuf into the executable while only ggml/llama go shared. No PIC change, no base-grpc-* rebuild required.

turboquant copies llama-cpp's CMakeLists.txt + Makefile per flavor, so it inherits all three for free.

Per-platform specifics

arm64: ggml's armv9.2 SME variants use -march=...+sme, rejected by Ubuntu 24.04's default gcc-13. The arm64 compile step installs and uses gcc-14. (For CI's prebuilt base-grpc-arm64, gcc-14 should move into the base — follow-up.)
darwin: no bundled ld.so, so ggml scans the binary's own directory. ggml emits its loadable backends (CPU variants, metal, blas) with a .so suffix even on macOS, while core libs are .dylib — .so go in the package root (for the scan), .dylib in lib/ (for DYLD_LIBRARY_PATH). Metal is preserved (GGML_METAL stays ON; --target ggml builds ggml-metal).

ik-llama-cpp: excluded

Its pinned ggml has zero CPU_ALL_VARIANTS support and its IQK kernels require AVX2, so the per-microarch dynamic backend set does not apply. Left unchanged (avx2/fallback).

Validation (all live inference, not just builds)

Target	Build	Smoke (LoadModel + Predict)
llama-cpp x86	✅	✅ selected AVX512-BF16 (zen) variant, coherent tokens
llama-cpp arm64 (dgx/Grace)	✅ 8 arm variants shipped	✅ selected SVE+MATMUL_INT8 variant, coherent tokens
llama-cpp darwin (M4)	✅ apple_m1/m2_m3/m4 + metal	✅ selected apple_m4 (SME=1) + Metal, coherent tokens
turboquant x86	✅ 14 variants, 0 link errors	(build/package verified)
turboquant arm64	🔄 in progress	-

Each shipped image was inspected to confirm the variant set is present and cpu-all is NEEDED-linked against shared ggml with gRPC static; the old avx/avx2/avx512 binaries are gone.

🤖 Generated with Claude Code

Replace the per-microarch avx/avx2/avx512/fallback multi-binary build on x86 with a single grpc-server plus the dlopen-able libggml-cpu-*.so set that ggml's backend registry selects at runtime by probing host CPU features. One build instead of four, broader microarch coverage (adds alderlake AVX-VNNI, zen4 AVX512-BF16, sapphirerapids AMX), and the shell-side /proc/cpuinfo probing in run.sh goes away. Build/link notes: - CPU_ALL_VARIANTS requires GGML_BACKEND_DL + BUILD_SHARED_LIBS=ON, so ggml/llama become shared objects. SHARED_LIBS is now a make variable (default OFF) so the override survives the recursive sub-make into the VARIANT build dir instead of being re-clobbered by the base flags. - The cpu-all target also builds "--target ggml": the per-microarch backends are runtime-dlopened, not link deps, so they only compile via ggml's add_dependencies(). - hw_grpc_proto is pinned STATIC. Under BUILD_SHARED_LIBS=ON it would otherwise become a DSO referencing hidden-visibility symbols in the static libprotobuf.a, which fails to link ("hidden symbol ... is referenced by DSO"). Keeping it static links gRPC/protobuf into the executable while only ggml/llama stay shared, so no PIC or base-image change is required. - package.sh bundles the libggml-*.so set into package/lib; ggml finds them by scanning the bundled ld.so directory (/proc/self/exe), which run.sh launches from. Scope: x86 only. arm64/darwin keep the single fallback build. The ik-llama-cpp / turboquant forks and the other ggml C++ backends are unchanged; the same recipe applies but is out of scope here. Validated with a full docker build plus a live inference smoke test: the model loads, ggml selects the AVX512_BF16 variant on a Zen-class host, and tokens generate correctly. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

…uant - llama-cpp: x86 AND arm64 now use the single llama-cpp-cpu-all build (only hipblas keeps the fallback build). ggml's arm64 variant table (armv8.x / armv9.x, plus apple_m* on darwin) is selected at runtime. - turboquant: same recipe via a turboquant-cpu-all target. turboquant copies backend/cpp/llama-cpp's CMakeLists.txt + Makefile per flavor, so the hw_grpc_proto STATIC fix and the SHARED_LIBS / EXTRA_CMAKE_ARGS make-vars are inherited; the target just passes SHARED_LIBS=ON, the DL flags and --target ggml through, then collects the .so set. run.sh and package.sh updated to ship/select turboquant-cpu-all. - Makefile lib-collection find now also matches *.dylib (for the darwin build, which emits dylibs rather than .so). ik-llama-cpp is intentionally left unchanged: its pinned ggml has no CPU_ALL_VARIANTS support and its IQK kernels require AVX2, so the per-microarch dynamic backend set does not apply. Scope still excludes the darwin packaging wiring (separate change). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

…u-all packaging - arm64: ggml CPU_ALL_VARIANTS builds armv9.2 SME variants whose -march=...+sme is rejected by the Ubuntu 24.04 default gcc-13. Build the arm64 variants with gcc-14 (installed in the compile step). The host only selects a variant it actually supports at runtime, but every variant must still compile. - darwin: scripts/build/llama-cpp-darwin.sh builds llama-cpp-cpu-all instead of the fallback binary, keeps Metal (GGML_METAL stays ON; --target ggml also builds ggml-metal). The per-microarch libggml-cpu-*.dylib are placed in the package root next to the binary (darwin has no bundled ld.so, so ggml's executable-dir scan looks there), while the other shared dylibs go in lib/ for DYLD_LIBRARY_PATH. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

….dylib lib) ggml emits its loadable backends (per-microarch CPU variants, metal, blas) with a .so suffix even on darwin, while the core libraries (ggml-base/ggml/llama/ llama-common/mtmd) use .dylib. Split the distribution by suffix: .so DL backends go in the package root for ggml's executable-directory scan, .dylib core libs go in lib/ for DYLD_LIBRARY_PATH. The previous .dylib name-pattern matched none of the variants. Verified on an M4: ggml loads the apple_m4 CPU variant (SME=1) and Metal, model loads and generates correct tokens. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

mudler added 4 commits June 24, 2026 21:21

localai-bot changed the title ~~feat(llama-cpp): single x86 CPU build via ggml CPU_ALL_VARIANTS~~ feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple) Jun 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple)#10497

feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple)#10497
localai-bot wants to merge 4 commits into
masterfrom
feat/llama-cpp-cpu-all-variants

localai-bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

localai-bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How it links (the non-obvious parts)

Per-platform specifics

ik-llama-cpp: excluded

Validation (all live inference, not just builds)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

localai-bot commented Jun 24, 2026 •

edited

Loading