-
Notifications
You must be signed in to change notification settings - Fork 839
Description
Problem
Health probe settings (timeoutSeconds, failureThreshold, etc.) are hardcoded in templates and cannot be overridden via values.yaml. This causes issues in environments with higher network latency, where the default 1s timeout and low failure thresholds trigger unnecessary pod restarts and 502 errors.
For example, harbor-core's liveness/readiness probes have failureThreshold: 2 with no configurable timeoutSeconds (defaults to 1s). In high-latency environments, transient latency spikes cause probe failures, pod restarts (~every 2 days), and endpoint churn leading to intermittent 502 Bad Gateway errors.
Current state
Only two components have any probe configurability in values.yaml:
core.startupProbe.enabledandcore.startupProbe.initialDelaySecondsdatabase.internal.livenessProbe.timeoutSecondsanddatabase.internal.readinessProbe.timeoutSeconds
All other probe parameters across all 9 components (core, portal, jobservice, registry, nginx, exporter, trivy, database, redis) are hardcoded in templates.
Proposal
Make probe timing parameters (initialDelaySeconds, periodSeconds, timeoutSeconds, failureThreshold, successThreshold) configurable via values.yaml for all components, following the existing pattern used by database.internal.livenessProbe.timeoutSeconds.
Default values would match the current hardcoded values exactly, so there is zero behavioral change on upgrade.
Example values.yaml structure
core:
startupProbe:
enabled: true
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 1
failureThreshold: 360
successThreshold: 1
livenessProbe:
initialDelaySeconds: 0
periodSeconds: 10
timeoutSeconds: 1
failureThreshold: 2
successThreshold: 1
readinessProbe:
initialDelaySeconds: 0
periodSeconds: 10
timeoutSeconds: 1
failureThreshold: 2
successThreshold: 1Example usage for high-latency environments
# values-override.yaml
core:
livenessProbe:
timeoutSeconds: 5
failureThreshold: 5
readinessProbe:
timeoutSeconds: 5
failureThreshold: 5
startupProbe:
timeoutSeconds: 5
failureThreshold: 5Components affected
All 9 components with probes: core, portal, jobservice, registry (registry + registryctl containers), nginx, exporter, trivy, database, redis.
I'm happy to open a PR for this if the approach looks good.