sec: Don't block for update on OIDC service start#28608
Conversation
Upon OIDC service start, it will attempt to get the certificates from the IdP. If the service is unable to communicate with the IdP, this process times out after 5 seconds. If this occurs during cluster start up, this will prevent leadership election from occurring, which is especially troublesome with the controller. This has led to a cascade of other failures (such as being unable to assign the cluster ID in the metrics report service). This change starts the update in the background to permit the rest of startup to continue without being blocked. Signed-off-by: Michael Boquard <michael@redpanda.com>
There was a problem hiding this comment.
Pull Request Overview
This PR addresses a blocking issue during OIDC service startup that could prevent controller leadership election. Previously, the OIDC service would synchronously wait for certificate updates from the IdP during startup, causing a 5-second timeout if the IdP was unreachable. This blocking behavior during cluster startup prevented critical operations like leadership election and cluster ID assignment.
Key Changes:
- Changed OIDC service startup to spawn the update operation asynchronously instead of blocking on it
- Removed the gate holder acquisition in favor of
ssx::spawn_with_gateto manage the background task lifecycle
| auto holder = _gate.hold(); | ||
| co_await update(); | ||
| ssx::spawn_with_gate(_gate, [this] { return update(); }); | ||
| return ss::now(); |
There was a problem hiding this comment.
Is it just the 5s delay that causes the failed startup? I.e., does that lead to a timeout elsewhere?
There was a problem hiding this comment.
There's a timeout in config_manager awaiting controller leadership. This timeout is preventing controller leadership election from forming.
Retry command for Build#76558please wait until all jobs are finished before running the slash command |
CI test resultstest results on build#76558
|
|
/backport v25.3.x |
|
/backport v25.2.x |
|
/backport v25.1.x |
Upon OIDC service start, it will attempt to get the certificates from the IdP. If the service is unable to communicate with the IdP, this process times out after 5 seconds. If this occurs during cluster start up, this will prevent leadership election from occurring, which is especially troublesome with the controller. This has led to a cascade of other failures (such as being unable to assign the cluster ID in the metrics report service).
This change starts the update in the background to permit the rest of startup to continue without being blocked.
Backports Required
Release Notes
Bug Fixes