feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple)#10497
Open
localai-bot wants to merge 4 commits into
Open
feat: single-build ggml CPU_ALL_VARIANTS for llama-cpp + turboquant (x86/arm64/apple)#10497localai-bot wants to merge 4 commits into
localai-bot wants to merge 4 commits into
Conversation
Replace the per-microarch avx/avx2/avx512/fallback multi-binary build on
x86 with a single grpc-server plus the dlopen-able libggml-cpu-*.so set
that ggml's backend registry selects at runtime by probing host CPU
features. One build instead of four, broader microarch coverage (adds
alderlake AVX-VNNI, zen4 AVX512-BF16, sapphirerapids AMX), and the
shell-side /proc/cpuinfo probing in run.sh goes away.
Build/link notes:
- CPU_ALL_VARIANTS requires GGML_BACKEND_DL + BUILD_SHARED_LIBS=ON, so
ggml/llama become shared objects. SHARED_LIBS is now a make variable
(default OFF) so the override survives the recursive sub-make into the
VARIANT build dir instead of being re-clobbered by the base flags.
- The cpu-all target also builds "--target ggml": the per-microarch
backends are runtime-dlopened, not link deps, so they only compile via
ggml's add_dependencies().
- hw_grpc_proto is pinned STATIC. Under BUILD_SHARED_LIBS=ON it would
otherwise become a DSO referencing hidden-visibility symbols in the
static libprotobuf.a, which fails to link ("hidden symbol ... is
referenced by DSO"). Keeping it static links gRPC/protobuf into the
executable while only ggml/llama stay shared, so no PIC or base-image
change is required.
- package.sh bundles the libggml-*.so set into package/lib; ggml finds
them by scanning the bundled ld.so directory (/proc/self/exe), which
run.sh launches from.
Scope: x86 only. arm64/darwin keep the single fallback build. The
ik-llama-cpp / turboquant forks and the other ggml C++ backends are
unchanged; the same recipe applies but is out of scope here.
Validated with a full docker build plus a live inference smoke test:
the model loads, ggml selects the AVX512_BF16 variant on a Zen-class
host, and tokens generate correctly.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
…uant - llama-cpp: x86 AND arm64 now use the single llama-cpp-cpu-all build (only hipblas keeps the fallback build). ggml's arm64 variant table (armv8.x / armv9.x, plus apple_m* on darwin) is selected at runtime. - turboquant: same recipe via a turboquant-cpu-all target. turboquant copies backend/cpp/llama-cpp's CMakeLists.txt + Makefile per flavor, so the hw_grpc_proto STATIC fix and the SHARED_LIBS / EXTRA_CMAKE_ARGS make-vars are inherited; the target just passes SHARED_LIBS=ON, the DL flags and --target ggml through, then collects the .so set. run.sh and package.sh updated to ship/select turboquant-cpu-all. - Makefile lib-collection find now also matches *.dylib (for the darwin build, which emits dylibs rather than .so). ik-llama-cpp is intentionally left unchanged: its pinned ggml has no CPU_ALL_VARIANTS support and its IQK kernels require AVX2, so the per-microarch dynamic backend set does not apply. Scope still excludes the darwin packaging wiring (separate change). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
…u-all packaging - arm64: ggml CPU_ALL_VARIANTS builds armv9.2 SME variants whose -march=...+sme is rejected by the Ubuntu 24.04 default gcc-13. Build the arm64 variants with gcc-14 (installed in the compile step). The host only selects a variant it actually supports at runtime, but every variant must still compile. - darwin: scripts/build/llama-cpp-darwin.sh builds llama-cpp-cpu-all instead of the fallback binary, keeps Metal (GGML_METAL stays ON; --target ggml also builds ggml-metal). The per-microarch libggml-cpu-*.dylib are placed in the package root next to the binary (darwin has no bundled ld.so, so ggml's executable-dir scan looks there), while the other shared dylibs go in lib/ for DYLD_LIBRARY_PATH. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
….dylib lib) ggml emits its loadable backends (per-microarch CPU variants, metal, blas) with a .so suffix even on darwin, while the core libraries (ggml-base/ggml/llama/ llama-common/mtmd) use .dylib. Split the distribution by suffix: .so DL backends go in the package root for ggml's executable-directory scan, .dylib core libs go in lib/ for DYLD_LIBRARY_PATH. The previous .dylib name-pattern matched none of the variants. Verified on an M4: ggml loads the apple_m4 CPU variant (SME=1) and Metal, model loads and generates correct tokens. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Replaces the per-microarch multi-binary C++ backend builds with a single
grpc-serverplus the set of dlopen-ablelibggml-cpu-*libraries produced by ggml'sGGML_CPU_ALL_VARIANTS. ggml's backend registry probes the host CPU at runtime and loads the best variant, so the shell-side/proc/cpuinfoprobing inrun.shis gone.Applies to llama-cpp (x86 + arm64 + darwin/apple) and turboquant (x86 + arm64).
ik-llama-cppis intentionally excluded (see below).Why
How it links (the non-obvious parts)
CPU_ALL_VARIANTS→GGML_BACKEND_DL→BUILD_SHARED_LIBS=ON, so ggml/llama become shared objects. Three things make that coexist with the static-gRPCgrpc-server:SHARED_LIBSis a make variable (defaultOFF). An appended-DBUILD_SHARED_LIBS=ONgets re-clobbered by the recursive sub-make into the VARIANT build dir; a command-line make variable propagates and wins.--target ggmlis added — the per-microarch backends are runtime-dlopened, not link deps, so they only build via ggml'sadd_dependencies().hw_grpc_protois pinnedSTATIC— underBUILD_SHARED_LIBS=ONit would become a DSO referencing the hidden-visibility symbols in staticlibprotobuf.a(ld: hidden symbol ... referenced by DSO). Keeping it static links gRPC/protobuf into the executable while only ggml/llama go shared. No PIC change, nobase-grpc-*rebuild required.turboquant copies llama-cpp's
CMakeLists.txt+Makefileper flavor, so it inherits all three for free.Per-platform specifics
armv9.2SME variants use-march=...+sme, rejected by Ubuntu 24.04's default gcc-13. The arm64 compile step installs and uses gcc-14. (For CI's prebuiltbase-grpc-arm64, gcc-14 should move into the base — follow-up.)ld.so, so ggml scans the binary's own directory. ggml emits its loadable backends (CPU variants, metal, blas) with a.sosuffix even on macOS, while core libs are.dylib—.sogo in the package root (for the scan),.dylibinlib/(forDYLD_LIBRARY_PATH). Metal is preserved (GGML_METALstays ON;--target ggmlbuildsggml-metal).ik-llama-cpp: excluded
Its pinned ggml has zero
CPU_ALL_VARIANTSsupport and its IQK kernels require AVX2, so the per-microarch dynamic backend set does not apply. Left unchanged (avx2/fallback).Validation (all live inference, not just builds)
Each shipped image was inspected to confirm the variant set is present and
cpu-allisNEEDED-linked against shared ggml with gRPC static; the oldavx/avx2/avx512binaries are gone.🤖 Generated with Claude Code