feat(deploy): add OpenShift/K8s deployment manifests#5880
feat(deploy): add OpenShift/K8s deployment manifests#5880singlerider merged 2 commits intozeroclaw-labs:masterfrom
Conversation
Sample manifests for deploying ZeroClaw on OpenShift/Kubernetes with an external LLM provider. Includes Namespace, Secret, ConfigMap, Deployment, Service, and Route with edge TLS. Highlights: - Distroless image, hardened SecurityContext (runAsNonRoot, drop ALL caps, read-only root filesystem, seccomp RuntimeDefault) - Health probes on /health endpoint - Pairing auth support via config - Sample files with .gitignore pattern so users can customize without leaking secrets Tested on OpenShift (NERC cluster): pod runs at ~6Mi memory idle with full agent runtime, web UI, and gateway. Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Pavel Anni <panni@redhat.com>
WareWolf-MoonWall
left a comment
There was a problem hiding this comment.
Welcome, @pavelanni — and thank you for a real, live-cluster-validated contribution.
The security posture here is genuinely strong, and the test evidence is the kind that
makes a review faster and a merge more confident. Let me walk through what you got
right before I get to the items that need to change.
Intake checklist
A few process items to address before this can merge. These are not optional — they are
how we keep review quality consistent across contributors and track work intentionally per
FND-003.
- Fill out the PR template. The description you wrote is clear and useful, but it
does not follow.github/pull_request_template.md.
For adeploy-k8s/docs PR most sections are short or N/A — that is fine, but the
reviewer needs to know each section was considered, not skipped. Specifically needed:
Blast radius, Security & Privacy Impact (Yes/No answers are required even when
all are No), Compatibility, Rollback, and i18n Follow-Through. See the
template for exactly what each section asks. - Set
risk:andsize:labels via the sidebar. The CI auto-assigneddocs
correctly, butrisk: lowandsize: Sare also required per the label registry.
This PR qualifies forrisk: low(new subtree, no runtime changes). - Assign to the v0.7.2 milestone. Per FND-003 §3.5, PRs are assigned to the
milestone they target before merge. This fits cleanly under v0.7.2 maintenance scope.
If you do not have milestone access, leave a comment here and a maintainer will set it.
✅ Commendations
The security context is complete — not just present.
runAsNonRoot, allowPrivilegeEscalation: false, readOnlyRootFilesystem: true,
capabilities.drop: ALL, seccompProfile.type: RuntimeDefault — this is the full
hardened profile. Most sample manifests in open-source projects include two or three of
these and omit the rest. Having all five here sets a baseline that users will copy
forward into their own cluster configs without thinking about it. Shipping secure defaults
in the starting point is how security posture propagates without requiring every
downstream user to know what to ask for.
The .gitignore pattern is elegant.
*.yaml + !*-sample.yaml means user copies are gitignored by default before they ever
have to remember to add them. The pattern is scoped to deploy-k8s/ so it does not
affect the rest of the repository. This is the kind of design that prevents secret leaks
from contributors who are moving fast and not thinking about what just got staged. The
convention is also self-documenting — anyone who reads the .gitignore understands the
copy-and-customize workflow immediately.
Secret injection is done correctly throughout.
The API key flows through a Kubernetes Secret and arrives in the container via
secretKeyRef, not hardcoded in the ConfigMap or as a literal env value. Using
stringData (not base64 data) in secret-sample.yaml is the right ergonomic choice
for a human-authored template — it is readable, diff-friendly, and the placeholder value
is unambiguous. This whole pattern is exactly what we want contributors to replicate.
Both probes are differentiated with real intent.
Readiness fires at 5s initial / 10s cycle; liveness fires at 10s initial / 60s cycle.
This is not cargo-culted from a tutorial — it reflects that the pod should stop receiving
traffic quickly when it is not ready, but should only be restarted under genuine
sustained failure. The distinction matters especially for an agent runtime that may be
mid-tool-execution when a transient health blip occurs.
The test evidence is the kind that makes a reviewer confident.
Pod running on NERC, route accessible, gateway healthy, pairing auth exercised, memory
footprint measured at ~6Mi idle. That is a complete deployment story, not a "manifests
look plausible" assertion. The ~6Mi figure is also genuinely useful signal for anyone
planning cluster resource allocation. Keep bringing this level of validation to future
PRs — it is exactly what the template's "Beyond CI" field is asking for.
🔴 Blocking
require_pairing = false ships an unauthenticated gateway as the default.
configmap-sample.yaml line 23 sets require_pairing = false. The README quick-start
takes the user from copy → edit API key → oc apply -f deploy-k8s/ in six steps. Anyone
who follows that path ends up with a ZeroClaw gateway publicly reachable on their
OpenShift Route with no authentication required.
The PR description notes "Pairing auth (require_pairing = true) verified working via
Route" — but the sample ships the opposite. There is a mismatch between what was tested
and what becomes the default for every user who copies this.
Change require_pairing = false to require_pairing = true in configmap-sample.yaml.
Per FND-006, secure defaults
belong in the starting point. Users who intentionally want a public endpoint (e.g.,
deployed behind a corporate SSO proxy that handles authentication at the network layer)
can change one line. Users who do not realize this setting exists should not have to
actively choose security — it should be the default they inherit.
A short comment above the line is welcome but not required:
# Set to false only if authentication is handled at the network/proxy layer.
require_pairing = true🟡 Conditional
api_key = "" in the config.toml block is a V1 legacy field that will emit a
migration warning on every pod startup.
I traced the credential resolution path in crates/zeroclaw-providers/src/lib.rs. The
API_KEY env var injected from the Secret does work — it is the documented generic
fallback at the end of the resolution chain (provider-specific env vars are checked
first, then ZEROCLAW_API_KEY, then API_KEY). There is no functional bug here.
However, api_key = "" in config.toml is a V1 legacy field. V1Compat::has_legacy_fields()
returns true when api_key.is_some(), including for an empty string, and the runtime
logs a migration notice on every startup as a result. Users will see a deprecation
warning that references a migration step they have no reason to take. That is confusing
and noisy for an otherwise clean deployment.
Remove the api_key = "" line from the config.toml block in configmap-sample.yaml.
The API_KEY env var is resolved at the env-fallback stage regardless of whether the
legacy field is present. The line can be deleted without any change in runtime behavior.
If you want to make the provider-specific convention more explicit, you could also rename
the env var from API_KEY to ANTHROPIC_API_KEY in deployment-sample.yaml (that is
the name ZeroClaw's own error messages reference for Anthropic), but API_KEY is valid —
this is your call. The deletion of the legacy config field is the required part.
ℹ️ Notes (no action required to merge)
State and workspace are ephemeral.
Both state and workspace volumes are emptyDir: {} — agent memory, session history,
and workspace files are lost on pod restart. For a sample this is the right choice (it
avoids PVC complexity that would obscure the core deployment pattern). Worth a one-liner
in the README's Configuration table or a note before the Cleanup section so users are
not surprised when a pod reschedule wipes their agent's context:
"Note: state and workspace use emptyDir — agent memory does not persist across pod
restarts. For production deployments, replace these volumes with PersistentVolumeClaims."
No Ingress path for vanilla Kubernetes.
The Route object is OpenShift-specific. The test plan mentions "Test on vanilla
Kubernetes (without Route)" but the README only shows oc commands. A single line — "On
vanilla Kubernetes, replace route-sample.yaml with a Kubernetes Ingress targeting port
42617" — would complete the story for non-OpenShift users without requiring a new file
this PR.
:latest image tag.
Adding a comment like # For production, replace :latest with a pinned version tag (e.g. v0.7.1)
aligns with the action-pinning discipline tracked in #5876
and signals good practice to readers who will copy this manifest forward.
Summary
Before this can merge:
- Fill out the PR template (Blast radius, Security & Privacy Impact, Compatibility,
Rollback, i18n Follow-Through) — most answers will be "No" or "N/A", that is fine. - Set
risk: lowandsize: Slabels via the sidebar. - Assign to the v0.7.2 milestone (or ask a maintainer to set it).
- 🔴 Change
require_pairing = false→require_pairing = trueinconfigmap-sample.yaml. - 🟡 Remove
api_key = ""from the config.toml block inconfigmap-sample.yaml.
Items 4 and 5 are both one-line changes in the same file. Once the template is filled and
the config defaults are corrected, this is ready to merge. The manifests are well-constructed,
the security hardening is right, and the cluster validation evidence is solid.
Address PR zeroclaw-labs#5880 review feedback: - Enable require_pairing by default in configmap-sample (was false) - Remove legacy api_key field that triggers V1 migration warnings - Add note about pinning :latest to a version tag - Document ephemeral state and vanilla K8s Ingress alternative Signed-off-by: Pavel Anni <panni@redhat.com> Assisted-By: Claude (Anthropic AI) <noreply@anthropic.com> Signed-off-by: Pavel Anni <panni@redhat.com>
|
I pushed the changes and filled out the PR template. But it looks like I don't have access to the labels and milestone settings. Please let me know if I missed something. |
WareWolf-MoonWall
left a comment
There was a problem hiding this comment.
PR Re-Review — #5880 feat(deploy): add OpenShift/K8s deployment manifests
Reviewing on commit 5ebc254. I read the full diff, the updated PR body, and the prior
review thread. Here is a precise accounting of what is resolved and what remains.
Resolved items ✅
🔴 Blocking — require_pairing = false (RESOLVED)
configmap-sample.yaml now has require_pairing = true. Verified directly in the
current diff. This was the most consequential item — it closes the unauthenticated
gateway default that the prior quick-start path would have shipped. ✅
🟡 Conditional — api_key = "" legacy field (RESOLVED)
The api_key = "" line is not present in the current configmap-sample.yaml diff.
The API_KEY env var from the Secret resolves correctly at the env-fallback stage
without it, and the migration warning on every pod startup is gone. ✅
ℹ️ Advisory notes — all three incorporated (RESOLVED)
All three informational notes from the prior review were picked up in the current
commits — no action was required, and each one was addressed anyway:
emptyDirpersistence caveat: README now has a "State is ephemeral" note under
Notes pointing users toward PersistentVolumeClaims for production. ✅- Vanilla Kubernetes path: README now has a "Vanilla Kubernetes" note directing
non-OpenShift users to replaceroute-sample.yamlwith a Kubernetes Ingress. ✅ :latestimage tag:deployment-sample.yamlnow carries the inline comment
# For production, pin to a version tag (e.g. ghcr.io/zeroclaw-labs/zeroclaw:v0.7.1). ✅
This level of follow-through on advisory notes is not required — it is appreciated.
Milestone — v0.7.2 (RESOLVED) ✅
Still open
🔴 PR template sections are missing
The current PR body has a Summary section and a Validation Evidence block, but the
required template sections are still absent. For a deploy-k8s/ docs PR most answers
are short — but the template exists so reviewers can confirm each dimension was
considered, and so the audit trail is consistent across all PRs per FND-003.
The sections still needed (from .github/pull_request_template.md):
- Label Snapshot —
risk: low,size: S, scopedocs/deploy - Change Metadata — change type (
featureordocs), primary scope - Security Impact — all four Yes/No fields (all
Nofor this PR) - Privacy and Data Hygiene — data-hygiene status and neutral wording confirmation
- Compatibility / Migration — backward compatible:
Yes, no config/migration changes - i18n Follow-Through —
No— no user-facing strings changed - Human Verification — the cluster test evidence you already have is exactly what
belongs here; move it from Validation Evidence or duplicate it - Side Effects / Blast Radius — affected subsystems:
deploy-k8s/subtree only,
no runtime changes - Rollback Plan —
git revert <sha>removes the subtree, no side effects
The answers for this PR are nearly all one-liners or "No / N/A". The substance is
already in your description — it is a matter of placing it into the correct sections.
🟡 Labels risk: low and size: S are not set
The label check shows only ["docs"]. Both risk: low and size: S are required
before merge. If you do not have label-setting access as a contributor, leave a comment
here and a maintainer (@theonlyhennygod or @singlerider) can apply them — this is not
a blocker on your end if access is the issue, just flag it.
🟡 Validation evidence is stale
The Validation Evidence block in the current PR body shows a local run with a
cargo clippy error in crates/zeroclaw-config/src/policy.rs:552. That error is a
pre-existing upstream issue unrelated to this PR (which touches no Rust files), and CI
is showing ✓ passing on the current commit. The body should either:
- Replace the evidence block with a note pointing to the passing CI run on
commit5ebc254, or - Keep the local run output but add a sentence clarifying that the clippy error is a
pre-existing upstream issue not introduced by this PR, and that CI on this branch
passes.
Either is fine. The current state is just confusing — a reader sees a compile error
in the evidence block and then a green checkmark on the PR, and has to reconcile them
without explanation.
Code is correct — no new findings on the diff
To be explicit: the manifests themselves are in good shape. The security context is
complete, secret injection is correct, probe differentiation is intentional, the
.gitignore pattern is clean, and the config defaults now align with FND-006. There are
no new code-level findings in this re-review.
To merge
Two things left:
- Fill in the missing PR template sections (most are one-liners or "No").
- Set
risk: lowandsize: Slabels — or ask a maintainer to set them if you lack
access. - Update or annotate the validation evidence block to explain the pre-existing clippy
noise.
Once those are done this is ready to merge.
|
@pavelanni can you also take a look at and comment on #5890 when you have a chance while I do my PR review rounds? |
WareWolf-MoonWall
left a comment
There was a problem hiding this comment.
PR Re-Review — #5880 feat(deploy): add OpenShift/K8s deployment manifests
Reviewing on commit 5ebc254. All three items from my prior CHANGES_REQUESTED have been
addressed. The manifests themselves are unchanged and remain correct. This is ready to merge.
Prior findings — resolved ✅
🔴 PR template sections missing (RESOLVED ✅)
All required sections are now present: Security & Privacy Impact (all four Yes/No fields
answered), Compatibility, Label Snapshot, Change Metadata, Privacy & Data Hygiene,
i18n Follow-Through, Side Effects / Blast Radius, and Rollback Plan. The answers are
exactly right for this PR type — short, precise, and honest about scope. The section
content matches what was already in the description; it just needed to be placed in the
correct structure for the audit trail. ✅
🟡 Labels risk: low and size: S not set (ADDRESSED ✅)
The Label Snapshot section explicitly notes that risk: low and size: S are the correct
labels, states the reason label permissions aren't available, and tags @theonlyhennygod
and @singlerider by name to apply them. This is exactly the right move for a contributor
without label access — the intent is on the record, the right people are named, and the
work is not blocked. The label application itself is a maintainer action, not a contributor
blocker. ✅
🟡 Validation evidence stale — pre-existing clippy error unexplained (RESOLVED ✅)
The NOTE at the top of the Validation Evidence section is clear and accurate: the clippy
error is pre-existing in zeroclaw-config/src/policy.rs:552, unrelated to this PR (which
touches no Rust files), and CI on 5ebc254 is passing green. A reader no longer has to
reconcile a compile error in the evidence block with a green CI checkmark without
explanation. ✅
No new findings on the diff
The diff is unchanged from the prior re-review. The security context, secret injection,
probe differentiation, .gitignore pattern, and config defaults are all still correct.
The three advisory notes from the first review (ephemeral state caveat, vanilla K8s path,
:latest pin comment) are still present in the README and deployment manifest. Nothing
new to raise.
🟢 Nitpick (entirely optional — do not hold the PR for this)
The Privacy & Data Hygiene section heading has a typo: "Privace" instead of "Privacy".
One character fix if you want to clean it up on a future edit, but this is purely cosmetic
and does not affect anything.
To merge
Two things remain — neither is on pavelanni's plate:
- Labels: A maintainer (@theonlyhennygod or @singlerider) needs to apply
risk: low
andsize: Svia the sidebar. - Review sign-off: JordanTheJet and theonlyhennygod are listed as requested reviewers
and have not yet weighed in. Once one of them reviews (or a maintainer clears the
request), this is merge-ready.
The contribution is complete. The code is correct, the security posture is solid, the
cluster validation evidence is real, and the template is now fully filled. Nice work
closing the loop on all three items cleanly.
singlerider
left a comment
There was a problem hiding this comment.
Reviewed. Purely additive deploy-k8s/ directory, no source changes. SecurityContext is properly hardened. emptyDir ephemerality is documented. No issues.
Summary
master(all contributions)runAsNonRoot, drop all capabilities, read-only root filesystem, seccompRuntimeDefaultSecurity & Privacy Impact (required)
Yes/No for each. Answer any
Yeswith a 1–2 sentence explanation.Yes, describe the risk and mitigation:Compatibility (required)
Label Snapshot
risk: lowandsize: Sbut I don't have permissions to change them. Please, @theonlyhennygod and @singlerider update the labels.Change Metadata
Privace and Data Hygiene
i18n Follow-Through
Side Effects / Blast Radius
Rollback Plan
— git revert removes the subtree, no side effects
Validation Evidence (required)
NOTE: the clippy error is a pre-existing upstream issue not introduced by this PR, and that CI on this branch passes.
Test plan
kubectl apply --dry-run=clientvalidationrequire_pairing = true) verified working via Route🤖 Generated with Claude Code