-
Notifications
You must be signed in to change notification settings - Fork 512
[wip][OTA-1544]update: support accepted risks #1807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
247c6c4
to
5778409
Compare
5c2ed41
to
77497e7
Compare
…isk" Make it clearer how '--accept ConditionalUpdateRisk' maps to a risk like NonZonalAzureMachineSetScaling getting accepted, by turning the previous: Reason: accepted NonZonalAzureMachineSetScaling into: Reason: accepted NonZonalAzureMachineSetScaling via ConditionalUpdateRisk Eventually we'll have an API that allows us to use the conditional-update risk name itself (e.g. '--accept NonZonalAzureMachineSetScaling') [1,2], but this 'via...' context will hopefully help avoid confusion in the meantime. [1]: openshift/enhancements#1807 [2]: https://issues.redhat.com/browse/OTA-1543
dd72591
to
191fcab
Compare
|
||
Note that missing `--accept` in the above command means accepting no risks at all and `cv.spec.desiredUpdate.accept` is going to set to the empty set. A cluster admin who chooses to do GitOps on the ClusterVersion manifest should not use `oc adm upgrade` to perform a cluster update. | ||
|
||
The cluster-version operator finds that the update to `4.18.16` is not recommended because of the risks `DualStackNeedsController` `OldBootImagesPodmanMissingAuthFlag`, and `RHELKernelHighLoadIOWait` and only the first two of them are accepted by the administrator. Thus, the cluster update to `4.18.16` is blocked. After a couple of weeks, `4.18.17` is released which contains the fixes of `DualStackNeedsController` and `RHELKernelHighLoadIOWait`. The only remained risk of `4.18.17` is `OldBootImagesPodmanMissingAuthFlag`. When the cluster is updated to `4.18.17`, e.g., by the following command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thus, the cluster update to
4.18.16
is blocked.
How would this "blocked" be reported to the user? I am asking because the current check is client-side AFAIK, and is a part of oc adm upgrade
client code. If there is a risk on the selected version, oc adm upgrade
will not let you proceed without supplying the --allow-not-recommended
option, and will not update the .spec.desiredUpdate
. But if you e.g. gitops or oc edit
the ClusterVersion manually, providing the .spec.desiredUpdate
, there is nothing in the CVO that would stop the update (CVO relies on the client-side oc adm upgrade
check).
This indicates the check moves to the controller; how would it report the "blocked" state?
And also, would we port the oc adm upgrade
client-side check to this new mechanism --allow-not-recommended
is basically equivalent to "accept all risks on this path". Do I understand correctly that after the change, a bare oc adm upgrade --to=1.2.3
would no longer be refused through the client-side check, it would edit the .spec.desiredIUpdate
but the list of accepted updates would be empty so CVO would not start the update until the risks evaluate away or the admin accepts them post-hoc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bare oc adm upgrade --to=1.2.3 would no longer be refused through the client-side check
If this went through, and therefore explicitly cleared accept
, then how does adding the explicit names of the accepted risks become re-usable over the course of several upgrades?
Or does this only get accepted if the accept
list already contains all of the related risks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my mind, we will add the guard to CVO and do not have to change the existing client side guard.
Assume a cluster admin wants to do a condition update to 1.2.3.
Today, a cluster will do oc adm upgrade --to 1.2.3 --allow-not-recommended
(as Petr pointed out, --allow-not-recommended
is needed because the update to 1.2.3 is conditional).
After the enhancement, the cluster admin will still do the above command to trigger the update. Thus, the .spec.desiredUpdate
is modified accordingly.
Then CVO sees the new .spec.desiredUpdate
and checks if it is a conditional update:
- If yes, then check if its risks are included in
spec.desiredUpdate.accept
.- if yes, accept and proceed with the update.
- if no, blocked the update. We will set up
ReleaseAccepted=False
as Petr realized already here.
- if no, proceed with the upgrade.
Joel, I hope the above addresses your questions too. Let me know otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this went through, and therefore explicitly cleared accept, then how does adding the explicit names of the accepted risks become re-usable over the course of several upgrades?
After re-reading your comment above again today, I think I understand the question better now.
Since spec.desiredUpdate.accept
is going to be cleared if --accept
is not used in the command, that means the admin decide NOT to reuse the accepted risks, i.e., the risks are no longer accepted to the admin. Otherwise, --accept
would have been provided.
Moreover, a more common use case is to reuse the risk evaluation across clusters. Once an admin upgrade a cluster with oc adm upgrade --to 1.2.3 --accept RiskA,RiskB
and is happy with the result, then the same command could be applied to other (similar) clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, thinking about this, we recently talked about this in the API PR didn't we? The behaviour you've described isn't aligning with my expectations from that discussion
In the API, we've said it's up to the end user to clear the list. Do we still expect this behaviour in the CLI?
I'm wondering if it makes sense for the accepted risks to just be persisted and have an easier way to iterate over multiple upgrades with the same risk names? Or does that not happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The re-use mainly is about multi-clusters. That is also the main use case for the feature provided in this enhancement.
For an admin that manages only one cluster: It still depends on the user to clear the list.
In this workflow, it is via NOT providing --accept
in the command (to clear the list).
This command is just a shortcut to oc edit
or oc apply
.
Are you suggesting that oc adm upgrade
should not prune any field in spec
at all?
iterate over multiple upgrades with the same risk names? Or does that not happen?
It could happen (although only occasionally): 1.1.1 (RiskA) -> 1.1.2 (RiskA) -> 1.1.3
Arguably, we may avoid claiming RiskA from the 2nd edge above but we do not exclude it today.
Are you suggesting that oc adm upgrade
should do "append" only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting that oc adm upgrade should not prune any field in spec at all?
I think maybe we want an explicit way to clear the list, and that the standard behaviour would be to just append anything new from the CLI to the existing list.
Otherwise it creates an odd discrepancy between those using the oc adm upgrade
CLI utility, and those using other means to edit the ClusterVersion
resource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
I think maybe we want an explicit way to clear the list
This is another evidence that you like Workflow 2. I have counted it in #1807 (comment)
I will consider this one resolved.
#### Comparing Two Workflows: | ||
|
||
Benefits of `oc adm upgrade --to 4.y.z --accept RiskA,RiskB`: | ||
|
||
* No need for a new subcommand, so manipulating `cv.spec.desiredUpdate` stays consolidated. | ||
* Encourages cluster-admins to build their own systems for managing acceptance information (e.g. GitOps), because while the new `cv.spec.desiredUpdate.accept` allows admins to store the risks they accept, it doesn't have space for them to explain why they find those risks acceptable, and that seems like important information that a team of admins sharing risk-acceptance decision-making would want to have available. | ||
|
||
Benefits of `oc adm upgrade accept [--append] RiskA,RiskB`: | ||
|
||
* Convenient appending, so an admin can roll forward with the things previous admins have already accepted, without having to worry about what those were. | ||
* Convenient way to share accepted risk names (but not your reasoning for acceptance) via ClusterVersion, without having to build your own system to share those between multiple cluster-admins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me an interesting difference is the context and its implications.
If I run oc adm upgrade --to=version --accept riska,riskb
, then I naively expect that acceptance to be tied to that single update: "update to that version and I don't care about that risk now". I probably would not expect that acceptance to persist into the future. Imagine the scenario where there would be a risk on the path to the next minor, and the admin wants to update to the next patch (without a risk). Maybe the admin is clueless or uses a bash history or whatever, so they use a --to=1.2.patch --accept RiskX
to be sure. The risk is irrelevant for that specific upgrade, but it persists into the future, somewhat surprisingly (at least to me).
The standalone command IMO communicates the persistence a bit better through how it is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I naively expect that acceptance to be tied to that single update ... I probably would not expect that acceptance to persist into the future.
It makes sense to me too.
However, it should not a big deal anyway. My assumption here is that the acceptance of a risk is not version specific: you accept RiskA today, you accept it forever. If you change your mind for any reason, oc edit
to remove it from the accepted risks.
But sure, I will count two votes (Petr and Joel) for Workflow 2.
risks: # deprecated by riskNames | ||
``` | ||
|
||
When a conditional update is accepted, the names of its associated risks are going to be merged into `clusterversion.status.history.acceptedRisks` which is an existing field before this enhancement. For example, CVO's acceptance of `4.18.17` leads to `OldBootImagesPodmanMissingAuthFlag` being a part of value of `clusterversion.status.history.acceptedRisks`: The wording might look like "The target release ... is exposed to the risks [OldBootImagesPodmanMissingAuthFlag] and accepted by CVO because all of them are considered acceptable". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we want to introduce a new, structured field (list of accepted risks -- or items, see below) instead of trying to pack everything into a string.
I am thinking whether this all leads to some abstract notion of "things that an admin accepts when updating". We now have accepted risks, --alllow-explicit-upgrade, --allow-with-warnings, --force... All these are a kind of acceptance, maybe it warrants a structured list of items that were accepted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me think about this and get back to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure if I understand your suggestion correctly.
You want a struct
instead of the current string
for status.history.acceptedRisks
, right?
$ oc explain clusterversion.status.history.acceptedRisks
GROUP: config.openshift.io
KIND: ClusterVersion
VERSION: v1
FIELD: acceptedRisks <string>
DESCRIPTION:
acceptedRisks records risks which were accepted to initiate the update.
For example, it may menition an Upgradeable=False or missing signature
that was overriden via desiredUpdate.force, or an update that was
initiated despite not being in the availableUpdates set of recommended
update targets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want a struct instead of the current string
More like a list of items, either strings or maybe structs:
acceptedByAdmin:
- type: known-issue
riskName: AWSBug
- type: warnings
warnings:
- warning-one
- warning-two
It's just a thought. This is a stable API so we cannot change status.history.acceptedRisks
but we may start maintaining a new structured one. Probably beyond this enhancement though.
|
||
When a conditional update is accepted, the names of its associated risks are going to be merged into `clusterversion.status.history.acceptedRisks` which is an existing field before this enhancement. For example, CVO's acceptance of `4.18.17` leads to `OldBootImagesPodmanMissingAuthFlag` being a part of value of `clusterversion.status.history.acceptedRisks`: The wording might look like "The target release ... is exposed to the risks [OldBootImagesPodmanMissingAuthFlag] and accepted by CVO because all of them are considered acceptable". | ||
|
||
When a conditional update is blocked, there is a condition in `clusterversion.status.conditions` with `ReleaseAccepted=False`, e.g., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so this answers one of my earlier comments, blocking through "ReleaseAccepted=False` sounds reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we losing the immediate feedback that we have today? Why is it desirable for an admin to trigger an update and then have to check the cluster object to see if it is actually going to be actioned, rather than today, where they know if the upgrade command succeeds, the cluster will start moving?
It seems we are taking a backwards step in terms of UX here no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with client-based checks is that you don't get their benefit if you oc edit
or gitops.
I wonder if we could change oc adm upgrade
to change .spec
and then wait for CVO to update the .status
conditions and emit messages based on that. That way the server-side checks would propagate to CLI behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does oc adm upgrade
get the information to make the decision? Does it source it from the CVO conditional updates status that exists today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a version is either in the status.conditionalUpdates
or status.availableUpdates
depending on whether the cluster is exposed to a risk (the former) on that edge or not (the latter).
These status fields are populated by CVO (it gets the data from OSUS, feeds the condition promql (if risk is present) through the cluster's monitoring stack which results in placing the edge to one of these. They are populated asynchronously, not triggered by oc adm upgrade
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering, if a request to update the desired version did include with it, a list of risks then, whether we could implement, either through CEL on the CRD itself, or through a VAP, a way for it to check the same thing oc adm upgrade
is doing, but through admission time. Which then would allow oc edit
or GitOps clients to get the same sort of functionality.
Something to investigate perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could change oc adm upgrade to change .spec and then wait for CVO to update the .status conditions and emit messages based on that.
The "wait for" part sounds a big UX leap.
CVO have server side guards already. oc-cli does not wait for any of it, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whether we could implement, either through CEL on the CRD itself, or through a VAP, a way for it to check the same thing oc adm upgrade is doing, but through admission time.
Joel, is VAP short for validating admission policy?
(I am new to both CEL and VAP.) I feel this suggestion already applies to the implementation of oc-cli today.
In that case, we do not have to block the enhancement before providing the solution. We may move to it even afterwards. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, apologies, I meant validating admission policy.
And yes, this was just a comment for a general UX optimisation, it just seemed relevant to this thread as an alternative
- Announce deprecation and support policy of the existing feature | ||
- Deprecate the feature | ||
|
||
## Upgrade / Downgrade Strategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this enhancement proposes deprecating existing API fields, I would expect this section to discuss how exactly the transition happen. Because we are changing a L1 API with a compatibility guarantee, maybe we would need to maintain the old fields for some time to allow potential clients (ClusterVersion
is an API and we have no control over all its consumers) to migrate to the new shape?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting.
The deprecation message has been removed from the API because the new API is gated by a feature from TP and there is no way currently to include a part of GoDoc to a feature. openshift/api#2360 (comment)
Assume that we want to promote/GA this feature. We could add back the deprecation message. I guess we cannot just stop populate L1 API until OpenShift 5. So my feeling is in theory, the deprecation message is ONLY to encourage the users to use the new field (and then ready for OpenShift 5).
maybe we would need to maintain the old fields for some time to allow potential clients (ClusterVersion is an API and we have no control over all its consumers) to migrate to the new shape?
I am wondering if this sort of migration is an option for us at all?
Every reviewer asked about this. Maybe this is the time for me understand it better and include it in the enhancement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no way currently to include a part of GoDoc to a feature
I don't think this would be very hard at all to add to our toolset if you think it would be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if this sort of migration is an option for us at all?
Officially you should maintain it until OpenShift 5 and set both values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a sentence to make the meaning of deprecation
clear in the new patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me
…isk" Make it clearer how '--accept ConditionalUpdateRisk' maps to a risk like NonZonalAzureMachineSetScaling getting accepted, by turning the previous: Reason: accepted NonZonalAzureMachineSetScaling into: Reason: accepted NonZonalAzureMachineSetScaling via ConditionalUpdateRisk Eventually we'll have an API that allows us to use the conditional-update risk name itself (e.g. '--accept NonZonalAzureMachineSetScaling') [1,2], but this 'via...' context will hopefully help avoid confusion in the meantime. [1]: openshift/enhancements#1807 [2]: https://issues.redhat.com/browse/OTA-1543
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing the comments from @petr-muller, it seems there is a significant UX question about this enhancement. I agree with pretty much every of the UX issues raised by @petr-muller and I think we need to have some further discussion here before we commit to a particular API change
|
||
#### Workflow 1 | ||
|
||
Provide the exiting `oc adm upgrade` command with a new and optional `--accept` option to update a cluster, i.e., `--accept` take effective only when performing an cluster update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for accept
vs accept-risk
to be more explicit that we are talking about risks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also saw your comment there "see this being different between the API and the CLI" and the thump up there.
And here it is CLI again. :)
I am kind of feeling that you like "accept-risks" for both API and CLI. Am I correct?
|
||
Note that missing `--accept` in the above command means accepting no risks at all and `cv.spec.desiredUpdate.accept` is going to set to the empty set. A cluster admin who chooses to do GitOps on the ClusterVersion manifest should not use `oc adm upgrade` to perform a cluster update. | ||
|
||
The cluster-version operator finds that the update to `4.18.16` is not recommended because of the risks `DualStackNeedsController` `OldBootImagesPodmanMissingAuthFlag`, and `RHELKernelHighLoadIOWait` and only the first two of them are accepted by the administrator. Thus, the cluster update to `4.18.16` is blocked. After a couple of weeks, `4.18.17` is released which contains the fixes of `DualStackNeedsController` and `RHELKernelHighLoadIOWait`. The only remained risk of `4.18.17` is `OldBootImagesPodmanMissingAuthFlag`. When the cluster is updated to `4.18.17`, e.g., by the following command: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bare oc adm upgrade --to=1.2.3 would no longer be refused through the client-side check
If this went through, and therefore explicitly cleared accept
, then how does adding the explicit names of the accepted risks become re-usable over the course of several upgrades?
Or does this only get accepted if the accept
list already contains all of the related risks?
|
||
When a conditional update is accepted, the names of its associated risks are going to be merged into `clusterversion.status.history.acceptedRisks` which is an existing field before this enhancement. For example, CVO's acceptance of `4.18.17` leads to `OldBootImagesPodmanMissingAuthFlag` being a part of value of `clusterversion.status.history.acceptedRisks`: The wording might look like "The target release ... is exposed to the risks [OldBootImagesPodmanMissingAuthFlag] and accepted by CVO because all of them are considered acceptable". | ||
|
||
When a conditional update is blocked, there is a condition in `clusterversion.status.conditions` with `ReleaseAccepted=False`, e.g., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we losing the immediate feedback that we have today? Why is it desirable for an admin to trigger an update and then have to check the cluster object to see if it is actually going to be actioned, rather than today, where they know if the upgrade command succeeds, the cluster will start moving?
It seems we are taking a backwards step in terms of UX here no?
See https://github.com/openshift/enhancements/blob/e044f84e9b2bafa600e6c24e35d226463c2308a5/enhancements/multi-arch/heterogeneous-architecture-clusters.md?plain=1#L282 | ||
|
||
How does it affect any of the components running in the | ||
management cluster? How does it affect any components running split | ||
between the management cluster and guest cluster? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the HyperShift considerations? Can we fill out this section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current epic https://issues.redhat.com/browse/OCPSTRAT-2118 focuses on standalone OCP clusters.
I knew that we will need additional work for HCP for the API extension like openshift/hypershift#1954
I honestly do not know more than it.
I am still in the process of working on the next patch to address the comments from reviews. |
@hongkailiu: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: petr-muller The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The proposed API extension is in openshift/api#2360