Skip to content

sec: Don't block for update on OIDC service start#28608

Merged
michael-redpanda merged 1 commit into
redpanda-data:devfrom
michael-redpanda:ci/core-14542
Nov 18, 2025
Merged

sec: Don't block for update on OIDC service start#28608
michael-redpanda merged 1 commit into
redpanda-data:devfrom
michael-redpanda:ci/core-14542

Conversation

@michael-redpanda

Copy link
Copy Markdown
Contributor

Upon OIDC service start, it will attempt to get the certificates from the IdP. If the service is unable to communicate with the IdP, this process times out after 5 seconds. If this occurs during cluster start up, this will prevent leadership election from occurring, which is especially troublesome with the controller. This has led to a cascade of other failures (such as being unable to assign the cluster ID in the metrics report service).

This change starts the update in the background to permit the rest of startup to continue without being blocked.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

Bug Fixes

  • Addresses an issue if the OIDC service fails to communicate with the IdP causing the prevention of controller leadership election

Upon OIDC service start, it will attempt to get the certificates from
the IdP.  If the service is unable to communicate with the IdP, this
process times out after 5 seconds.  If this occurs during cluster start
up, this will prevent leadership election from occurring, which is
especially troublesome with the controller.  This has led to a cascade
of other failures (such as being unable to assign the cluster ID in the
metrics report service).

This change starts the update in the background to permit the rest of
startup to continue without being blocked.

Signed-off-by: Michael Boquard <michael@redpanda.com>
@michael-redpanda michael-redpanda self-assigned this Nov 18, 2025
Copilot AI review requested due to automatic review settings November 18, 2025 15:33

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a blocking issue during OIDC service startup that could prevent controller leadership election. Previously, the OIDC service would synchronously wait for certificate updates from the IdP during startup, causing a 5-second timeout if the IdP was unreachable. This blocking behavior during cluster startup prevented critical operations like leadership election and cluster ID assignment.

Key Changes:

  • Changed OIDC service startup to spawn the update operation asynchronously instead of blocking on it
  • Removed the gate holder acquisition in favor of ssx::spawn_with_gate to manage the background task lifecycle

auto holder = _gate.hold();
co_await update();
ssx::spawn_with_gate(_gate, [this] { return update(); });
return ss::now();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just the 5s delay that causes the failed startup? I.e., does that lead to a timeout elsewhere?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a timeout in config_manager awaiting controller leadership. This timeout is preventing controller leadership election from forming.

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Retry command for Build#76558

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/log_compaction_test.py::LogCompactionTxRemovalUpgradeTest.test_tx_control_batch_removal_with_upgrade@{"test_case_name":"All commits"}

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#76558
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
AuditLogTestKafkaApi test_no_auth_enabled {"audit_transport_mode": "kclient"} integration https://buildkite.com/redpanda/redpanda/builds/76558#019a97c1-4fc0-4031-9fe3-d82535bb8aa5 FLAKY 19/21 upstream reliability is '99.21011058451816'. current run reliability is '90.47619047619048'. drift is 8.73392 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestKafkaApi&test_method=test_no_auth_enabled
LogCompactionTxRemovalUpgradeTest test_tx_control_batch_removal_with_upgrade {"test_case_name": "All commits"} integration https://buildkite.com/redpanda/redpanda/builds/76558#019a97bb-cd8e-46bc-907e-9c1b373857c0 FLAKY 8/21 upstream reliability is '88.35616438356165'. current run reliability is '38.095238095238095'. drift is 50.26093 and the allowed drift is set to 50. The test should FAIL https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalUpgradeTest&test_method=test_tx_control_batch_removal_with_upgrade

@michael-redpanda

Copy link
Copy Markdown
Contributor Author

@michael-redpanda michael-redpanda merged commit 1fb57c8 into redpanda-data:dev Nov 18, 2025
15 of 18 checks passed
@vbotbuildovich

Copy link
Copy Markdown
Collaborator

/backport v25.3.x

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

/backport v25.2.x

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

/backport v25.1.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants