Skip to content

Conversation

hongkailiu
Copy link
Member

@hongkailiu hongkailiu commented Jun 10, 2025

The proposed API extension is in openshift/api#2360

@hongkailiu hongkailiu changed the title [OTA-1544]update: support accepted risks [wip][OTA-1544]update: support accepted risks Jun 10, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 10, 2025
@hongkailiu hongkailiu force-pushed the OTA-1544 branch 2 times, most recently from 247c6c4 to 5778409 Compare June 10, 2025 23:46
@hongkailiu hongkailiu force-pushed the OTA-1544 branch 2 times, most recently from 5c2ed41 to 77497e7 Compare June 17, 2025 20:01
wking added a commit to wking/oc that referenced this pull request Jun 18, 2025
…isk"

Make it clearer how '--accept ConditionalUpdateRisk' maps to a risk
like NonZonalAzureMachineSetScaling getting accepted, by turning the
previous:

  Reason: accepted NonZonalAzureMachineSetScaling

into:

  Reason: accepted NonZonalAzureMachineSetScaling via ConditionalUpdateRisk

Eventually we'll have an API that allows us to use the
conditional-update risk name itself (e.g. '--accept
NonZonalAzureMachineSetScaling') [1,2], but this 'via...' context will
hopefully help avoid confusion in the meantime.

[1]: openshift/enhancements#1807
[2]: https://issues.redhat.com/browse/OTA-1543
@hongkailiu hongkailiu force-pushed the OTA-1544 branch 3 times, most recently from dd72591 to 191fcab Compare July 3, 2025 00:59

Note that missing `--accept` in the above command means accepting no risks at all and `cv.spec.desiredUpdate.accept` is going to set to the empty set. A cluster admin who chooses to do GitOps on the ClusterVersion manifest should not use `oc adm upgrade` to perform a cluster update.

The cluster-version operator finds that the update to `4.18.16` is not recommended because of the risks `DualStackNeedsController` `OldBootImagesPodmanMissingAuthFlag`, and `RHELKernelHighLoadIOWait` and only the first two of them are accepted by the administrator. Thus, the cluster update to `4.18.16` is blocked. After a couple of weeks, `4.18.17` is released which contains the fixes of `DualStackNeedsController` and `RHELKernelHighLoadIOWait`. The only remained risk of `4.18.17` is `OldBootImagesPodmanMissingAuthFlag`. When the cluster is updated to `4.18.17`, e.g., by the following command:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thus, the cluster update to 4.18.16 is blocked.

How would this "blocked" be reported to the user? I am asking because the current check is client-side AFAIK, and is a part of oc adm upgrade client code. If there is a risk on the selected version, oc adm upgrade will not let you proceed without supplying the --allow-not-recommended option, and will not update the .spec.desiredUpdate. But if you e.g. gitops or oc edit the ClusterVersion manually, providing the .spec.desiredUpdate, there is nothing in the CVO that would stop the update (CVO relies on the client-side oc adm upgrade check).

This indicates the check moves to the controller; how would it report the "blocked" state?

And also, would we port the oc adm upgrade client-side check to this new mechanism --allow-not-recommended is basically equivalent to "accept all risks on this path". Do I understand correctly that after the change, a bare oc adm upgrade --to=1.2.3 would no longer be refused through the client-side check, it would edit the .spec.desiredIUpdate but the list of accepted updates would be empty so CVO would not start the update until the risks evaluate away or the admin accepts them post-hoc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bare oc adm upgrade --to=1.2.3 would no longer be refused through the client-side check

If this went through, and therefore explicitly cleared accept, then how does adding the explicit names of the accepted risks become re-usable over the course of several upgrades?

Or does this only get accepted if the accept list already contains all of the related risks?

Copy link
Member Author

@hongkailiu hongkailiu Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind, we will add the guard to CVO and do not have to change the existing client side guard.

Assume a cluster admin wants to do a condition update to 1.2.3.

Today, a cluster will do oc adm upgrade --to 1.2.3 --allow-not-recommended (as Petr pointed out, --allow-not-recommended is needed because the update to 1.2.3 is conditional).

After the enhancement, the cluster admin will still do the above command to trigger the update. Thus, the .spec.desiredUpdate is modified accordingly.

Then CVO sees the new .spec.desiredUpdate and checks if it is a conditional update:

  • If yes, then check if its risks are included in spec.desiredUpdate.accept.
    • if yes, accept and proceed with the update.
    • if no, blocked the update. We will set up ReleaseAccepted=False as Petr realized already here.
  • if no, proceed with the upgrade.

Joel, I hope the above addresses your questions too. Let me know otherwise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this went through, and therefore explicitly cleared accept, then how does adding the explicit names of the accepted risks become re-usable over the course of several upgrades?

After re-reading your comment above again today, I think I understand the question better now.

Since spec.desiredUpdate.accept is going to be cleared if --accept is not used in the command, that means the admin decide NOT to reuse the accepted risks, i.e., the risks are no longer accepted to the admin. Otherwise, --accept would have been provided.

Moreover, a more common use case is to reuse the risk evaluation across clusters. Once an admin upgrade a cluster with oc adm upgrade --to 1.2.3 --accept RiskA,RiskB and is happy with the result, then the same command could be applied to other (similar) clusters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, thinking about this, we recently talked about this in the API PR didn't we? The behaviour you've described isn't aligning with my expectations from that discussion

In the API, we've said it's up to the end user to clear the list. Do we still expect this behaviour in the CLI?

I'm wondering if it makes sense for the accepted risks to just be persisted and have an easier way to iterate over multiple upgrades with the same risk names? Or does that not happen?

Copy link
Member Author

@hongkailiu hongkailiu Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The re-use mainly is about multi-clusters. That is also the main use case for the feature provided in this enhancement.

For an admin that manages only one cluster: It still depends on the user to clear the list.
In this workflow, it is via NOT providing --accept in the command (to clear the list).
This command is just a shortcut to oc edit or oc apply.
Are you suggesting that oc adm upgrade should not prune any field in spec at all?

iterate over multiple upgrades with the same risk names? Or does that not happen?

It could happen (although only occasionally): 1.1.1 (RiskA) -> 1.1.2 (RiskA) -> 1.1.3
Arguably, we may avoid claiming RiskA from the 2nd edge above but we do not exclude it today.

Are you suggesting that oc adm upgrade should do "append" only?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting that oc adm upgrade should not prune any field in spec at all?

I think maybe we want an explicit way to clear the list, and that the standard behaviour would be to just append anything new from the CLI to the existing list.

Otherwise it creates an odd discrepancy between those using the oc adm upgrade CLI utility, and those using other means to edit the ClusterVersion resource

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

I think maybe we want an explicit way to clear the list

This is another evidence that you like Workflow 2. I have counted it in #1807 (comment)

I will consider this one resolved.

Comment on lines 128 to 144
#### Comparing Two Workflows:

Benefits of `oc adm upgrade --to 4.y.z --accept RiskA,RiskB`:

* No need for a new subcommand, so manipulating `cv.spec.desiredUpdate` stays consolidated.
* Encourages cluster-admins to build their own systems for managing acceptance information (e.g. GitOps), because while the new `cv.spec.desiredUpdate.accept` allows admins to store the risks they accept, it doesn't have space for them to explain why they find those risks acceptable, and that seems like important information that a team of admins sharing risk-acceptance decision-making would want to have available.

Benefits of `oc adm upgrade accept [--append] RiskA,RiskB`:

* Convenient appending, so an admin can roll forward with the things previous admins have already accepted, without having to worry about what those were.
* Convenient way to share accepted risk names (but not your reasoning for acceptance) via ClusterVersion, without having to build your own system to share those between multiple cluster-admins.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me an interesting difference is the context and its implications.

If I run oc adm upgrade --to=version --accept riska,riskb, then I naively expect that acceptance to be tied to that single update: "update to that version and I don't care about that risk now". I probably would not expect that acceptance to persist into the future. Imagine the scenario where there would be a risk on the path to the next minor, and the admin wants to update to the next patch (without a risk). Maybe the admin is clueless or uses a bash history or whatever, so they use a --to=1.2.patch --accept RiskX to be sure. The risk is irrelevant for that specific upgrade, but it persists into the future, somewhat surprisingly (at least to me).

The standalone command IMO communicates the persistence a bit better through how it is used.

Copy link
Member Author

@hongkailiu hongkailiu Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I naively expect that acceptance to be tied to that single update ... I probably would not expect that acceptance to persist into the future.

It makes sense to me too.

However, it should not a big deal anyway. My assumption here is that the acceptance of a risk is not version specific: you accept RiskA today, you accept it forever. If you change your mind for any reason, oc edit to remove it from the accepted risks.

But sure, I will count two votes (Petr and Joel) for Workflow 2.

risks: # deprecated by riskNames
```

When a conditional update is accepted, the names of its associated risks are going to be merged into `clusterversion.status.history.acceptedRisks` which is an existing field before this enhancement. For example, CVO's acceptance of `4.18.17` leads to `OldBootImagesPodmanMissingAuthFlag` being a part of value of `clusterversion.status.history.acceptedRisks`: The wording might look like "The target release ... is exposed to the risks [OldBootImagesPodmanMissingAuthFlag] and accepted by CVO because all of them are considered acceptable".
Copy link
Member

@petr-muller petr-muller Jul 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want to introduce a new, structured field (list of accepted risks -- or items, see below) instead of trying to pack everything into a string.

I am thinking whether this all leads to some abstract notion of "things that an admin accepts when updating". We now have accepted risks, --alllow-explicit-upgrade, --allow-with-warnings, --force... All these are a kind of acceptance, maybe it warrants a structured list of items that were accepted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think about this and get back to it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure if I understand your suggestion correctly.
You want a struct instead of the current string for status.history.acceptedRisks, right?

$ oc explain clusterversion.status.history.acceptedRisks
GROUP:      config.openshift.io
KIND:       ClusterVersion
VERSION:    v1

FIELD: acceptedRisks <string>


DESCRIPTION:
    acceptedRisks records risks which were accepted to initiate the update.
    For example, it may menition an Upgradeable=False or missing signature
    that was overriden via desiredUpdate.force, or an update that was
    initiated despite not being in the availableUpdates set of recommended
    update targets.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want a struct instead of the current string

More like a list of items, either strings or maybe structs:

acceptedByAdmin:
- type: known-issue
  riskName: AWSBug
- type: warnings
  warnings:
  - warning-one
  - warning-two

It's just a thought. This is a stable API so we cannot change status.history.acceptedRisks but we may start maintaining a new structured one. Probably beyond this enhancement though.


When a conditional update is accepted, the names of its associated risks are going to be merged into `clusterversion.status.history.acceptedRisks` which is an existing field before this enhancement. For example, CVO's acceptance of `4.18.17` leads to `OldBootImagesPodmanMissingAuthFlag` being a part of value of `clusterversion.status.history.acceptedRisks`: The wording might look like "The target release ... is exposed to the risks [OldBootImagesPodmanMissingAuthFlag] and accepted by CVO because all of them are considered acceptable".

When a conditional update is blocked, there is a condition in `clusterversion.status.conditions` with `ReleaseAccepted=False`, e.g.,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so this answers one of my earlier comments, blocking through "ReleaseAccepted=False` sounds reasonable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we losing the immediate feedback that we have today? Why is it desirable for an admin to trigger an update and then have to check the cluster object to see if it is actually going to be actioned, rather than today, where they know if the upgrade command succeeds, the cluster will start moving?

It seems we are taking a backwards step in terms of UX here no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with client-based checks is that you don't get their benefit if you oc edit or gitops.

I wonder if we could change oc adm upgrade to change .spec and then wait for CVO to update the .status conditions and emit messages based on that. That way the server-side checks would propagate to CLI behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does oc adm upgrade get the information to make the decision? Does it source it from the CVO conditional updates status that exists today?

Copy link
Member

@petr-muller petr-muller Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, a version is either in the status.conditionalUpdates or status.availableUpdates depending on whether the cluster is exposed to a risk (the former) on that edge or not (the latter).

These status fields are populated by CVO (it gets the data from OSUS, feeds the condition promql (if risk is present) through the cluster's monitoring stack which results in placing the edge to one of these. They are populated asynchronously, not triggered by oc adm upgrade.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering, if a request to update the desired version did include with it, a list of risks then, whether we could implement, either through CEL on the CRD itself, or through a VAP, a way for it to check the same thing oc adm upgrade is doing, but through admission time. Which then would allow oc edit or GitOps clients to get the same sort of functionality.

Something to investigate perhaps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could change oc adm upgrade to change .spec and then wait for CVO to update the .status conditions and emit messages based on that.

The "wait for" part sounds a big UX leap.

CVO have server side guards already. oc-cli does not wait for any of it, correct?

Copy link
Member Author

@hongkailiu hongkailiu Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whether we could implement, either through CEL on the CRD itself, or through a VAP, a way for it to check the same thing oc adm upgrade is doing, but through admission time.

Joel, is VAP short for validating admission policy?

(I am new to both CEL and VAP.) I feel this suggestion already applies to the implementation of oc-cli today.
In that case, we do not have to block the enhancement before providing the solution. We may move to it even afterwards. Right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, apologies, I meant validating admission policy.

And yes, this was just a comment for a general UX optimisation, it just seemed relevant to this thread as an alternative

- Announce deprecation and support policy of the existing feature
- Deprecate the feature

## Upgrade / Downgrade Strategy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this enhancement proposes deprecating existing API fields, I would expect this section to discuss how exactly the transition happen. Because we are changing a L1 API with a compatibility guarantee, maybe we would need to maintain the old fields for some time to allow potential clients (ClusterVersion is an API and we have no control over all its consumers) to migrate to the new shape?

Copy link
Member Author

@hongkailiu hongkailiu Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting.

The deprecation message has been removed from the API because the new API is gated by a feature from TP and there is no way currently to include a part of GoDoc to a feature. openshift/api#2360 (comment)

Assume that we want to promote/GA this feature. We could add back the deprecation message. I guess we cannot just stop populate L1 API until OpenShift 5. So my feeling is in theory, the deprecation message is ONLY to encourage the users to use the new field (and then ready for OpenShift 5).

maybe we would need to maintain the old fields for some time to allow potential clients (ClusterVersion is an API and we have no control over all its consumers) to migrate to the new shape?

I am wondering if this sort of migration is an option for us at all?

Every reviewer asked about this. Maybe this is the time for me understand it better and include it in the enhancement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no way currently to include a part of GoDoc to a feature

I don't think this would be very hard at all to add to our toolset if you think it would be useful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if this sort of migration is an option for us at all?

Officially you should maintain it until OpenShift 5 and set both values

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a sentence to make the meaning of deprecation clear in the new patch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me

wking added a commit to wking/oc that referenced this pull request Jul 7, 2025
…isk"

Make it clearer how '--accept ConditionalUpdateRisk' maps to a risk
like NonZonalAzureMachineSetScaling getting accepted, by turning the
previous:

  Reason: accepted NonZonalAzureMachineSetScaling

into:

  Reason: accepted NonZonalAzureMachineSetScaling via ConditionalUpdateRisk

Eventually we'll have an API that allows us to use the
conditional-update risk name itself (e.g. '--accept
NonZonalAzureMachineSetScaling') [1,2], but this 'via...' context will
hopefully help avoid confusion in the meantime.

[1]: openshift/enhancements#1807
[2]: https://issues.redhat.com/browse/OTA-1543
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing the comments from @petr-muller, it seems there is a significant UX question about this enhancement. I agree with pretty much every of the UX issues raised by @petr-muller and I think we need to have some further discussion here before we commit to a particular API change


#### Workflow 1

Provide the exiting `oc adm upgrade` command with a new and optional `--accept` option to update a cluster, i.e., `--accept` take effective only when performing an cluster update.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for accept vs accept-risk to be more explicit that we are talking about risks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openshift/api#2360 (comment)

Also saw your comment there "see this being different between the API and the CLI" and the thump up there.

And here it is CLI again. :)

I am kind of feeling that you like "accept-risks" for both API and CLI. Am I correct?


Note that missing `--accept` in the above command means accepting no risks at all and `cv.spec.desiredUpdate.accept` is going to set to the empty set. A cluster admin who chooses to do GitOps on the ClusterVersion manifest should not use `oc adm upgrade` to perform a cluster update.

The cluster-version operator finds that the update to `4.18.16` is not recommended because of the risks `DualStackNeedsController` `OldBootImagesPodmanMissingAuthFlag`, and `RHELKernelHighLoadIOWait` and only the first two of them are accepted by the administrator. Thus, the cluster update to `4.18.16` is blocked. After a couple of weeks, `4.18.17` is released which contains the fixes of `DualStackNeedsController` and `RHELKernelHighLoadIOWait`. The only remained risk of `4.18.17` is `OldBootImagesPodmanMissingAuthFlag`. When the cluster is updated to `4.18.17`, e.g., by the following command:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bare oc adm upgrade --to=1.2.3 would no longer be refused through the client-side check

If this went through, and therefore explicitly cleared accept, then how does adding the explicit names of the accepted risks become re-usable over the course of several upgrades?

Or does this only get accepted if the accept list already contains all of the related risks?


When a conditional update is accepted, the names of its associated risks are going to be merged into `clusterversion.status.history.acceptedRisks` which is an existing field before this enhancement. For example, CVO's acceptance of `4.18.17` leads to `OldBootImagesPodmanMissingAuthFlag` being a part of value of `clusterversion.status.history.acceptedRisks`: The wording might look like "The target release ... is exposed to the risks [OldBootImagesPodmanMissingAuthFlag] and accepted by CVO because all of them are considered acceptable".

When a conditional update is blocked, there is a condition in `clusterversion.status.conditions` with `ReleaseAccepted=False`, e.g.,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we losing the immediate feedback that we have today? Why is it desirable for an admin to trigger an update and then have to check the cluster object to see if it is actually going to be actioned, rather than today, where they know if the upgrade command succeeds, the cluster will start moving?

It seems we are taking a backwards step in terms of UX here no?

Comment on lines +242 to +253
See https://github.com/openshift/enhancements/blob/e044f84e9b2bafa600e6c24e35d226463c2308a5/enhancements/multi-arch/heterogeneous-architecture-clusters.md?plain=1#L282

How does it affect any of the components running in the
management cluster? How does it affect any components running split
between the management cluster and guest cluster?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the HyperShift considerations? Can we fill out this section?

Copy link
Member Author

@hongkailiu hongkailiu Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current epic https://issues.redhat.com/browse/OCPSTRAT-2118 focuses on standalone OCP clusters.

I knew that we will need additional work for HCP for the API extension like openshift/hypershift#1954

I honestly do not know more than it.

@hongkailiu
Copy link
Member Author

I am still in the process of working on the next patch to address the comments from reviews.
My replies so far should unblock some of the discussion on UX from this enhancement.
I will continue addresses remained ones tomorrow.

Copy link
Contributor

openshift-ci bot commented Aug 6, 2025

@hongkailiu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint bad0c74 link true /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

openshift-ci bot commented Aug 15, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: petr-muller
Once this PR has been reviewed and has the lgtm label, please assign pratikmahajan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants