Skip to content

[Infra] Streamline Dockerfile.non_root build time#26055

Draft
yuneng-berri wants to merge 6 commits intolitellm_internal_stagingfrom
litellm_non-root-dockerfile-optimization-31b6
Draft

[Infra] Streamline Dockerfile.non_root build time#26055
yuneng-berri wants to merge 6 commits intolitellm_internal_stagingfrom
litellm_non-root-dockerfile-optimization-31b6

Conversation

@yuneng-berri
Copy link
Copy Markdown
Collaborator

Summary

Streamline docker/Dockerfile.non_root — the non-root image had accumulated redundant work over time, making CI builds slow (notably in test_server_root_path.yml, which rebuilds it on every PR). Each commit in this PR is a single targeted optimization, verified behavior-preserving against the prior baseline via static package/file diff, API-endpoint parity, UI visual regression, and CVE scan.

Also adds a UI Drift Guard CI workflow (first commit) so the streamlined image can safely stage its UI from the checked-in Next.js static export without ever shipping stale UI bytes.

Changes (one per commit)

  1. [CI] UI drift guard — new workflow that rebuilds the admin UI from source on PRs and fails if the committed litellm/proxy/_experimental/out/ has drifted. Also regenerates the committed export to match current source (it had drifted prior to this PR).
  2. Stage pre-built UI — use the committed Next.js static export at litellm/proxy/_experimental/out/ instead of running npm ci + npm run build inside the image. Removes the nvm-based Node bootstrap, npm install -g, and the full UI build step.
  3. Remove unused npm from runtime — npm was present only to globally install CVE-patched versions of tar/glob/brace-expansion/minimatch/diff and to rewrite npm's own package.json. Dropping npm eliminates both the ~25-line patch shuffle and the underlying CVE surface. nodejs is kept (Prisma needs it).
  4. Slim C toolchain in builder — drop clang, llvm, lld, linux-headers, build-base, openssl-dev and the now-orphan NVM_DIR env / /root/.nvm PATH prefix. Keep gcc + python3-dev (minimum needed for ml-dtypes, which has no py3.13 wheel yet).
  5. BuildKit uv cache mount — mount /app/.cache/uv as type=cache on both uv sync calls. Wheel archives move out of the image layer (~640 MB shaved), and repeat CI builds don't re-download them.
  6. Cleanups — drop the silenced prisma migrate diff dead line, the runtime's duplicated sed/chmod on the entrypoint scripts, the NPM_CONFIG_* ENVs and /.npm+/tmp/.npm dirs that became dead after step 3. Add ui/ to .dockerignore (after step 2 the UI source tree is never read during build).

Results

Metric Baseline Final (opt-5)
Cold-build wall-clock (fuse-overlayfs DinD test VM) 6m 25s 1m 05s
Image size 6.57 GB 1.96 GB
Trivy HIGH/CRITICAL CVE count 6 2 (no new)

Testing

Every optimization was verified against the prior baseline with a disposable harness covering:

  • Static diff — Python package set identical; OS package set differs only in expected removals (nodejs, npm, brotli*, icu78, libuv); UI file hashes differ only in the per-build Next.js build ID (content identical, confirmed by pixel-perfect UI regression); /app tree identical except for the deliberately-excluded ui/… and orphan /.npm paths; ownership/perms on /app, /app/.venv, /var/lib/litellm/ui, /var/lib/litellm/assets all match baseline.
  • API parity (12 scenarios, each run against a fresh Postgres on both sides, canonicalized-body diff):
    • GET /health/liveliness, GET /health/readiness
    • GET /v1/models, GET /model/info
    • POST /v1/chat/completions (via WireMock upstream)
    • POST /v1/embeddings (via WireMock upstream)
    • POST /key/generate, GET /key/info, POST /key/delete
    • POST /user/new → chain GET /user/info?user_id=<uid>POST /user/delete
      All scenarios match status + canonicalized body at every step.
  • UI visual regression (Playwright + pixelmatch, 10 admin-UI routes at 1280×900, full page): pixel-perfect match at every step.
  • Trivy HIGH/CRITICAL CVE diff: no new CVEs introduced at any step; 4 CVEs dropped off between baseline and final.
  • Existing test_server_root_path.yml workflow on this PR.

Type

🚄 Infrastructure
🧹 Refactoring

Open in Web Open in Cursor 

cursoragent and others added 6 commits April 19, 2026 04:03
Adds a CI job that rebuilds the admin UI from source and fails if the
committed static export at litellm/proxy/_experimental/out/ has drifted
from what npm run build produces. This prevents silently shipping stale
UI bytes and is a prerequisite for the non_root Dockerfile streamlining
work, which will stage the UI from _experimental/out/ directly instead
of rebuilding it inside the image.

Also regenerates litellm/proxy/_experimental/out/ to match a fresh
npm run build (Node 20.20.2) — the committed tree had drifted from
source prior to this commit.

Co-authored-by: yuneng-jiang <yuneng-berri@users.noreply.github.com>
The checked-in Next.js static export at litellm/proxy/_experimental/out/
is kept fresh by the UI Drift Guard CI workflow. Stage it directly
instead of re-running npm ci + npm run build inside the image.

This removes: nvm install, node 20.20.2 install, npm ci (801 pkgs),
next build, and the resulting intermediate node_modules/out tree.

Build time: ~6m25s -> ~2m (fuse-overlayfs DinD); image 6.57GB -> 5.0GB.
Behavior parity verified: API endpoints, UI screenshots (all 10 routes
pixel-perfect), and Trivy HIGH/CRITICAL CVE count (6 -> 5, one npm
GHSA removed) all match or improve over baseline.

Co-authored-by: yuneng-jiang <yuneng-berri@users.noreply.github.com>
npm was installed in the runtime only to globally install vulnerability
patched versions of tar/glob/brace-expansion/minimatch/diff and to
in-place rewrite npm's own bundled package.json. Both were to silence
CVE scanners against modules that ship with npm itself.

Since we no longer run npm anywhere in the runtime (Prisma uses the
node binary directly for migrate deploy and generate), we can just
skip installing npm in the first place. This eliminates both the
~25-line CVE-patch shuffle AND the underlying CVE surface.

Kept: nodejs (needed by prisma-python's CLI and migrate deploy).
Removed: npm apk package, all 'npm install -g', all find+sed patching,
the redundant 'apk upgrade --no-cache nodejs' (already covered by the
preceding 'apk upgrade').

Image: 4.97GB (opt-1) -> 4.97GB (opt-2); the real win is that two
CVEs (CVE-2026-33671 and GHSA-q4gf-8mx6-v5v3) drop off the Trivy
HIGH/CRITICAL list. No new CVEs introduced. API parity and UI
visual regression both match baseline.

Co-authored-by: yuneng-jiang <yuneng-berri@users.noreply.github.com>
After Task 2.1 removed the in-image Next.js build, the builder stage no
longer needs a full C/C++ + Clang toolchain. Keep gcc + python3-dev
(required to compile ml-dtypes 0.4.1 from source — no wheel published
for Python 3.13 yet). Drop everything else.

Removed from apk: clang, llvm, lld, linux-headers, build-base,
openssl-dev, npm. Removed NVM_DIR env and /root/.nvm from PATH
(no nvm-based Node install anymore).

Kept: python3, python3-dev, gcc, bash, coreutils, curl, openssl,
libsndfile, nodejs. gcc (15.2) serves both C and C++; the separate
g++ package doesn't exist in Wolfi.

Image size unchanged (builder stage doesn't end up in the runtime);
cold builds slightly slower due to ml-dtypes source compile, but that
will be recovered in the next task via a BuildKit uv cache mount.
API parity and UI visual regression both match baseline, Trivy
HIGH/CRITICAL CVE count unchanged from opt-2 (4 CVEs, none new).

Co-authored-by: yuneng-jiang <yuneng-berri@users.noreply.github.com>
Mount /app/.cache/uv as a BuildKit type=cache on both 'uv sync' steps.
The cache persists across builds on the same builder (and, when used
with type=gha in CI, across CI runs) so repeat builds don't re-download
every wheel.

Side-effect: because the cache lives outside the image layer, the
~742MB of downloaded wheel archives that were previously baked into
/app/.cache/uv drop out of the final image. Compressed image size
goes from ~5.0GB to ~3.7GB, and the 'USER nobody' prisma-generate
layer is 1.7GB vs 2.4GB.

Warm-build timing: a uv-sync-invalidating edit now takes ~1m30s vs
~2m39s without the cache mount, on this dev VM.

API parity and UI visual regression continue to match baseline.
Trivy HIGH/CRITICAL: 6 at baseline -> 2 now, no new CVEs.

Co-authored-by: yuneng-jiang <yuneng-berri@users.noreply.github.com>
Five small, individually-verified cleanups collected into one commit:

- Drop 'prisma migrate diff --from-empty ... > /dev/null 2>&1 || true'
  from the builder. Stdout/stderr/exit-status all discarded; nothing
  reads the output. Dead line.
- Drop 'mkdir -p /app/.cache/npm' from the same RUN. npm is gone.
- Drop the runtime's redundant 'sed -i' + 'chmod +x' on the entrypoint
  scripts. The builder already does the same three lines, and the
  runtime copies /app from the builder via COPY --from=builder, so
  the normalized files (and exec bits, which buildkit preserves) are
  already in place.
- Drop NPM_CONFIG_CACHE and NPM_CONFIG_PREFER_OFFLINE from the runtime
  ENV — nothing reads them after Task 2.2 removed npm.
- Drop '/.npm' and '/tmp/.npm' from the runtime's mkdir + chown. These
  directories only existed as npm's writable dirs for the non-root
  user; npm is gone.

.dockerignore: add 'ui/'. After Task 2.1 the non_root image sources
its UI bytes from litellm/proxy/_experimental/out/, so the whole
ui/litellm-dashboard/ source tree is dead weight when the blanket
'COPY . .' pulls it into /app. Verified (with ripgrep) that no Python
code under litellm/ opens any file under ui/. All string references to
'ui/...' are URL paths, not filesystem paths.

Final image size: 6.57GB baseline -> 1.96GB. API parity and UI visual
regression match baseline across all 12 API scenarios and 10 UI
routes. Trivy HIGH/CRITICAL: 6 -> 2, no new CVEs introduced.

Co-authored-by: yuneng-jiang <yuneng-berri@users.noreply.github.com>
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants