feat: multi-arch CUDA Dockerfile and sm_121 (DGX Spark GB10)#840
feat: multi-arch CUDA Dockerfile and sm_121 (DGX Spark GB10)#840nazq wants to merge 1 commit intohuggingface:mainfrom
Conversation
44f1190 to
8cf4772
Compare
alvarobartt
left a comment
There was a problem hiding this comment.
Thanks a lot for the PR @nazq, looks really clean!
Could you review and update also the table with the different images at https://github.com/huggingface/text-embeddings-inference/blob/main/docs/source/en/supported_models.md? Then I'll merge and validate that the CI is working as expected, hoping to release v1.9.3 next week.
And thanks for building on top of @z4y4ts PR and keeping them as co-author, much appreciated 🤗
Updated supported_models.md. I did update the CI too but I've not run it so all done by inspection. |
|
Hi @nazq thanks so much for that PR! I tested the PR on my Spark and I got a build failure: After searching a bit, I found out that this #842 PR should fix it. So I applied these changes and the build finished without any errors. So I guess only a rebase is needed. |
|
Great. Thanks for this i didn't buy a Spark till i knew we could get this PR in. Happy to rebase it |
- Add Dockerfile-cuda supporting both x86_64 and ARM64 (aarch64) - Add sm_121 compute capability for NVIDIA GB10 (DGX Spark) - Add cpu-arm64 image variant - Update supported hardware documentation Co-Authored-By: z4y4ts <z4y4ts@users.noreply.github.com>
a9395f8 to
ad55ed2
Compare
|
Hey @stefan-it — rebased onto upstream main, which now includes #842. Should fix the |
|
Hi @nazq many thanks! I did a fresh clone of the rebased branch and built it with: docker build . -f Dockerfile-cuda --no-cache --build-arg CUDA_COMPUTE_CAP=121 --platform linux/arm64 -t text-embeddings-inference:121-1.9-prresult was: [+] Building 895.2s (32/32) FINISHED docker:default
=> [internal] load build definition from Dockerfile-cuda 0.0s
=> => transferring dockerfile: 6.46kB 0.0s
=> [internal] load metadata for docker.io/nvidia/cuda:12.9.1-runtime-ubuntu24.04 0.2s
=> [internal] load metadata for docker.io/nvidia/cuda:12.9.1-devel-ubuntu24.04 0.2s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 53B 0.0s
=> [internal] load build context 0.0s
=> => transferring context: 17.28kB 0.0s
=> CACHED [base-builder 1/6] FROM docker.io/nvidia/cuda:12.9.1-devel-ubuntu24.04@sha256:020bc241a628776338f4d4053fed4c38f6f7f3d7eb5919fecb8de313bb8ba47c 0.0s
=> CACHED [base 1/3] FROM docker.io/nvidia/cuda:12.9.1-runtime-ubuntu24.04@sha256:1287141d283b8f06f45681b56a48a85791398c615888b1f96bfb9fc981392d98 0.0s
=> [base-builder 2/6] RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends curl libssl-dev pkg-config && rm -rf /var/l 22.1s
=> [base 2/3] RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends ca-certificates libssl-dev curl cuda-compat-12-9 19.6s
=> [base 3/3] COPY --chmod=775 cuda-entrypoint.sh entrypoint.sh 0.0s
=> [base-builder 3/6] RUN case "arm64" in "amd64") SCCACHE_ARCH=x86_64-unknown-linux-musl ;; "arm64") SCCACHE_ARCH=aarch64-unknown-linux-musl ;; *) echo "Unsupported 2.9s
=> [base-builder 4/6] COPY rust-toolchain.toml rust-toolchain.toml 0.0s
=> [base-builder 5/6] RUN curl https://sh.rustup.rs -sSf | bash -s -- -y 32.5s
=> [base-builder 6/6] RUN cargo install cargo-chef --version 0.1.73 --locked 49.9s
=> [planner 1/7] WORKDIR /usr/src 0.0s
=> [builder 2/9] RUN --mount=type=secret,id=actions_results_url,env=ACTIONS_RESULTS_URL --mount=type=secret,id=actions_runtime_token,env=ACTIONS_RUNTIME_TOKEN if [ 121 -g 0.6s
=> [planner 2/7] COPY backends backends 0.1s
=> [planner 3/7] COPY core core 0.1s
=> [planner 4/7] COPY router router 0.1s
=> [planner 5/7] COPY Cargo.toml ./ 0.1s
=> [planner 6/7] COPY Cargo.lock ./ 0.1s
=> [planner 7/7] RUN cargo chef prepare --recipe-path recipe.json 0.2s
=> [builder 3/9] COPY --from=planner /usr/src/recipe.json recipe.json 0.1s
=> [builder 4/9] RUN --mount=type=secret,id=actions_results_url,env=ACTIONS_RESULTS_URL --mount=type=secret,id=actions_runtime_token,env=ACTIONS_RUNTIME_TOKEN if [ 121 360.7s
=> [builder 5/9] COPY backends backends 0.1s
=> [builder 6/9] COPY core core 0.1s
=> [builder 7/9] COPY router router 0.1s
=> [builder 8/9] COPY Cargo.toml ./ 0.1s
=> [builder 9/9] COPY Cargo.lock ./ 0.1s
=> [http-builder 1/1] RUN --mount=type=secret,id=actions_results_url,env=ACTIONS_RESULTS_URL --mount=type=secret,id=actions_runtime_token,env=ACTIONS_RUNTIME_TOKEN if [ 423.1s
=> [stage-7 1/1] COPY --from=http-builder /usr/src/target/release/text-embeddings-router /usr/local/bin/text-embeddings-router 0.4s
=> exporting to image 1.5s
=> => exporting layers 1.4s
=> => writing image sha256:2018875deaebfac387abad481f0f2bb7979853ad2b607297aa8bdba5b1d67ef4 0.0s
=> => naming to docker.io/library/text-embeddings-inference:121-1.9-prSo definitely working on a Spark 🥳 |
|
I'll put my order in then ;-) |
Summary
Builds on #827 (ARM64 CPU Dockerfile) by extending CUDA support to ARM64 and adding the DGX Spark GB10's sm_121 compute capability. Also adds the CI matrix entries and README updates needed to ship ARM64 images.
Changes
Dockerfile-cuda (multi-arch)
TARGETARCHto select correct sccache binary (x86_64 or aarch64)TARGETARCHto select correct protoc binary (x86_64 or aarch_64)nvprunesection for DGX Spark GB10compute_cap.rs
(120..=121, 120) => true— sm_121 runtime is compatible with sm_120 compiled binaries(121, 121) => true— exact match for native sm_121 buildsflash_attn.rs
runtime_compute_cap == 121to use flash attention v2 (same arch family as sm_120)build.yaml
matrix.platformswith fallback tolinux/amd64— enables per-variant platform selection without breaking existing entriesmatrix.json
blackwell-121entry (linux/arm64,CUDA_COMPUTE_CAP=121) for DGX Spark GB10cpu-arm64entry (linux/arm64,Dockerfile-arm64) for ARM64 CPU-only hostsREADME.md
Platformcolumn to Docker Images tablecpu-arm64-1.9and121-1.9image entriesMotivation
The NVIDIA DGX Spark uses the GB10 SoC with compute capability 12.1 (sm_121). This is a Blackwell-family chip (Grace + Blackwell GPU) on ARM64. Without these changes, TEI cannot run on the DGX Spark with CUDA acceleration.
Testing
docker build -f Dockerfile-cuda --build-arg CUDA_COMPUTE_CAP=121 --platform linux/arm64 .compute_cap_matchingwith sm_121121-{version}-grpcandcpu-arm64-{version}-grpcimagesCloses #769