Skip to content

ClusterCondition::last_update_time is updated on no-ops, causing infinite reconciles (in the worst case) #1032

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nightkr opened this issue May 14, 2025 · 1 comment · May be fixed by #1054

Comments

@nightkr
Copy link
Member

nightkr commented May 14, 2025

Affected version

Yes. (Still an issue on trunk, introduced in #571, rolled out around SDP 23.4.)

Current and expected behavior

Reconciling a cluster where there nothing has changed should be a no-op.

ClusterCondition::last_update_time breaks this expectation since it is set unconditionally to whatever the current time is, rounded to the second (

if old_condition.status == new_condition.status {
ClusterCondition {
last_update_time: Some(now),
last_transition_time: old_condition.last_transition_time,
..new_condition
}
). This is registered as another object modification if the new reconcile is not within the same wall-second as the previous one. Depending on how long one reconcile takes, that can cause (up to) an infinite re-reconciliation loop while the object is trying to settle down (which is likely to be an indication that the cluster is struggling to begin with!).

Possible solution

  1. Drop last_update_time completely (for compat: either stub it out or make it equivalent to last_transition_time)
  2. Take the value from whenever the data source for the condition was updated, rather than the current wall time (if it makes sense/is possible for that condition)

Additional context

Discovered by @siegfriedweber, discussed at https://stackable-workspace.slack.com/archives/C02FZ581UCD/p1747230004370629

Environment

No response

Would you like to work on fixing this bug?

None

@maltesander
Copy link
Member

The approach back then was to follow the OpenShift ClusterOperatorStatusCondition, see https://github.com/openshift/api/blob/b1bcdbc3/config/v1/types_cluster_operator.go#L101.

There, the last_updated_time does not even appear so i am not sure why it was introduced here, as this would always only be the last timestamp the operator reconciled, which does not provide much value.

Suggestion is using Solution 1 and just drop it.

@maltesander maltesander moved this from In Refinement to In Progress in Stackable End-to-End Coordination Jun 12, 2025
@maltesander maltesander linked a pull request Jun 12, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

3 participants