[Infra] Streamline Dockerfile.non_root build time#26055
Draft
yuneng-berri wants to merge 6 commits intolitellm_internal_stagingfrom
Draft
[Infra] Streamline Dockerfile.non_root build time#26055yuneng-berri wants to merge 6 commits intolitellm_internal_stagingfrom
yuneng-berri wants to merge 6 commits intolitellm_internal_stagingfrom
Conversation
Adds a CI job that rebuilds the admin UI from source and fails if the committed static export at litellm/proxy/_experimental/out/ has drifted from what npm run build produces. This prevents silently shipping stale UI bytes and is a prerequisite for the non_root Dockerfile streamlining work, which will stage the UI from _experimental/out/ directly instead of rebuilding it inside the image. Also regenerates litellm/proxy/_experimental/out/ to match a fresh npm run build (Node 20.20.2) — the committed tree had drifted from source prior to this commit. Co-authored-by: yuneng-jiang <yuneng-berri@users.noreply.github.com>
The checked-in Next.js static export at litellm/proxy/_experimental/out/ is kept fresh by the UI Drift Guard CI workflow. Stage it directly instead of re-running npm ci + npm run build inside the image. This removes: nvm install, node 20.20.2 install, npm ci (801 pkgs), next build, and the resulting intermediate node_modules/out tree. Build time: ~6m25s -> ~2m (fuse-overlayfs DinD); image 6.57GB -> 5.0GB. Behavior parity verified: API endpoints, UI screenshots (all 10 routes pixel-perfect), and Trivy HIGH/CRITICAL CVE count (6 -> 5, one npm GHSA removed) all match or improve over baseline. Co-authored-by: yuneng-jiang <yuneng-berri@users.noreply.github.com>
npm was installed in the runtime only to globally install vulnerability patched versions of tar/glob/brace-expansion/minimatch/diff and to in-place rewrite npm's own bundled package.json. Both were to silence CVE scanners against modules that ship with npm itself. Since we no longer run npm anywhere in the runtime (Prisma uses the node binary directly for migrate deploy and generate), we can just skip installing npm in the first place. This eliminates both the ~25-line CVE-patch shuffle AND the underlying CVE surface. Kept: nodejs (needed by prisma-python's CLI and migrate deploy). Removed: npm apk package, all 'npm install -g', all find+sed patching, the redundant 'apk upgrade --no-cache nodejs' (already covered by the preceding 'apk upgrade'). Image: 4.97GB (opt-1) -> 4.97GB (opt-2); the real win is that two CVEs (CVE-2026-33671 and GHSA-q4gf-8mx6-v5v3) drop off the Trivy HIGH/CRITICAL list. No new CVEs introduced. API parity and UI visual regression both match baseline. Co-authored-by: yuneng-jiang <yuneng-berri@users.noreply.github.com>
After Task 2.1 removed the in-image Next.js build, the builder stage no longer needs a full C/C++ + Clang toolchain. Keep gcc + python3-dev (required to compile ml-dtypes 0.4.1 from source — no wheel published for Python 3.13 yet). Drop everything else. Removed from apk: clang, llvm, lld, linux-headers, build-base, openssl-dev, npm. Removed NVM_DIR env and /root/.nvm from PATH (no nvm-based Node install anymore). Kept: python3, python3-dev, gcc, bash, coreutils, curl, openssl, libsndfile, nodejs. gcc (15.2) serves both C and C++; the separate g++ package doesn't exist in Wolfi. Image size unchanged (builder stage doesn't end up in the runtime); cold builds slightly slower due to ml-dtypes source compile, but that will be recovered in the next task via a BuildKit uv cache mount. API parity and UI visual regression both match baseline, Trivy HIGH/CRITICAL CVE count unchanged from opt-2 (4 CVEs, none new). Co-authored-by: yuneng-jiang <yuneng-berri@users.noreply.github.com>
Mount /app/.cache/uv as a BuildKit type=cache on both 'uv sync' steps. The cache persists across builds on the same builder (and, when used with type=gha in CI, across CI runs) so repeat builds don't re-download every wheel. Side-effect: because the cache lives outside the image layer, the ~742MB of downloaded wheel archives that were previously baked into /app/.cache/uv drop out of the final image. Compressed image size goes from ~5.0GB to ~3.7GB, and the 'USER nobody' prisma-generate layer is 1.7GB vs 2.4GB. Warm-build timing: a uv-sync-invalidating edit now takes ~1m30s vs ~2m39s without the cache mount, on this dev VM. API parity and UI visual regression continue to match baseline. Trivy HIGH/CRITICAL: 6 at baseline -> 2 now, no new CVEs. Co-authored-by: yuneng-jiang <yuneng-berri@users.noreply.github.com>
Five small, individually-verified cleanups collected into one commit: - Drop 'prisma migrate diff --from-empty ... > /dev/null 2>&1 || true' from the builder. Stdout/stderr/exit-status all discarded; nothing reads the output. Dead line. - Drop 'mkdir -p /app/.cache/npm' from the same RUN. npm is gone. - Drop the runtime's redundant 'sed -i' + 'chmod +x' on the entrypoint scripts. The builder already does the same three lines, and the runtime copies /app from the builder via COPY --from=builder, so the normalized files (and exec bits, which buildkit preserves) are already in place. - Drop NPM_CONFIG_CACHE and NPM_CONFIG_PREFER_OFFLINE from the runtime ENV — nothing reads them after Task 2.2 removed npm. - Drop '/.npm' and '/tmp/.npm' from the runtime's mkdir + chown. These directories only existed as npm's writable dirs for the non-root user; npm is gone. .dockerignore: add 'ui/'. After Task 2.1 the non_root image sources its UI bytes from litellm/proxy/_experimental/out/, so the whole ui/litellm-dashboard/ source tree is dead weight when the blanket 'COPY . .' pulls it into /app. Verified (with ripgrep) that no Python code under litellm/ opens any file under ui/. All string references to 'ui/...' are URL paths, not filesystem paths. Final image size: 6.57GB baseline -> 1.96GB. API parity and UI visual regression match baseline across all 12 API scenarios and 10 UI routes. Trivy HIGH/CRITICAL: 6 -> 2, no new CVEs introduced. Co-authored-by: yuneng-jiang <yuneng-berri@users.noreply.github.com>
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Streamline
docker/Dockerfile.non_root— the non-root image had accumulated redundant work over time, making CI builds slow (notably intest_server_root_path.yml, which rebuilds it on every PR). Each commit in this PR is a single targeted optimization, verified behavior-preserving against the prior baseline via static package/file diff, API-endpoint parity, UI visual regression, and CVE scan.Also adds a UI Drift Guard CI workflow (first commit) so the streamlined image can safely stage its UI from the checked-in Next.js static export without ever shipping stale UI bytes.
Changes (one per commit)
litellm/proxy/_experimental/out/has drifted. Also regenerates the committed export to match current source (it had drifted prior to this PR).litellm/proxy/_experimental/out/instead of runningnpm ci+npm run buildinside the image. Removes the nvm-based Node bootstrap,npm install -g, and the full UI build step.tar/glob/brace-expansion/minimatch/diffand to rewrite npm's ownpackage.json. Dropping npm eliminates both the ~25-line patch shuffle and the underlying CVE surface.nodejsis kept (Prisma needs it).clang,llvm,lld,linux-headers,build-base,openssl-devand the now-orphanNVM_DIRenv //root/.nvmPATH prefix. Keepgcc+python3-dev(minimum needed forml-dtypes, which has no py3.13 wheel yet)./app/.cache/uvastype=cacheon bothuv synccalls. Wheel archives move out of the image layer (~640 MB shaved), and repeat CI builds don't re-download them.prisma migrate diffdead line, the runtime's duplicatedsed/chmodon the entrypoint scripts, theNPM_CONFIG_*ENVs and/.npm+/tmp/.npmdirs that became dead after step 3. Addui/to.dockerignore(after step 2 the UI source tree is never read during build).Results
opt-5)Testing
Every optimization was verified against the prior baseline with a disposable harness covering:
nodejs,npm,brotli*,icu78,libuv); UI file hashes differ only in the per-build Next.js build ID (content identical, confirmed by pixel-perfect UI regression);/apptree identical except for the deliberately-excludedui/…and orphan/.npmpaths; ownership/perms on/app,/app/.venv,/var/lib/litellm/ui,/var/lib/litellm/assetsall match baseline.GET /health/liveliness,GET /health/readinessGET /v1/models,GET /model/infoPOST /v1/chat/completions(via WireMock upstream)POST /v1/embeddings(via WireMock upstream)POST /key/generate,GET /key/info,POST /key/deletePOST /user/new→ chainGET /user/info?user_id=<uid>→POST /user/deleteAll scenarios match status + canonicalized body at every step.
test_server_root_path.ymlworkflow on this PR.Type
🚄 Infrastructure
🧹 Refactoring