Operator Status <Upgrade Readiness> Enhancement #23

awgreene · 2020-05-26T22:48:49Z

No description provided.

enhancements/upgrade-readiness.md

mhrivnak · 2020-05-27T23:34:44Z

If an operator is in the middle of performing this critical work and dies for some reason (bug, OOM, hardware failure, etc), it's important that it be able to resume that work as soon as it can re-start. We wouldn't want to upgrade the operator in such a scenario, when OLM is not able to reach the proposed endpoint. Would you agree? If so, it's worth capturing that behavior in this proposal: OLM will not initiate an upgrade without positive confirmation that the operator is ready for it.

But then we need an escape hatch. There will be scenarios where either the operator can't function correctly until it gets updated (so the endpoint is down), or there's a bug in its implementation of this endpoint, so maybe it's always reporting not-ready-for-upgrade. There needs to be a way to force the update even when the operator either can't be reached or says it can't be updated. Any thoughts on what that mechanism would be?

awgreene · 2020-05-28T13:43:46Z

Hello @mhrivnak,

Thanks for the detailed review!

If an operator is in the middle of performing this critical work and dies for some reason (bug, OOM, hardware failure, etc), it's important that it be able to resume that work as soon as it can re-start. We wouldn't want to upgrade the operator in such a scenario, when OLM is not able to reach the proposed endpoint. Would you agree? If so, it's worth capturing that behavior in this proposal: OLM will not initiate an upgrade without positive confirmation that the operator is ready for it.

Agreed - I believe if an operator defines an upgrade readiness endpoint OLM should not upgrade an operator when it is unable to connect to the endpoint. I'll update the enhancement.

But then we need an escape hatch. There will be scenarios where either the operator can't function correctly until it gets updated (so the endpoint is down), or there's a bug in its implementation of this endpoint, so maybe it's always reporting not-ready-for-upgrade. There needs to be a way to force the update even when the operator either can't be reached or says it can't be updated. Any thoughts on what that mechanism would be?

There is a manual process where the user can delete the subscription/csv to uninstall the operator and reinstall the operator at the desired version.

awgreene · 2020-05-28T13:56:36Z

If an operator is in the middle of performing this critical work and dies for some reason (bug, OOM, hardware failure, etc), it's important that it be able to resume that work as soon as it can re-start. We wouldn't want to upgrade the operator in such a scenario, when OLM is not able to reach the proposed endpoint. Would you agree? If so, it's worth capturing that behavior in this proposal: OLM will not initiate an upgrade without positive confirmation that the operator is ready for it.

@mhrivnak I am also going to make sure that we call out that the Upgrade Readiness Check is opt-in, meaning OLM will default to existing in the absence of an upgrade readiness endpoint, in the interest of support backwards compatibility.

enhancements/upgrade-readiness.md

Bowenislandsong

I think my comments are centered around one issue and hope to not cut a new version for CSV. These are mostly my concerns and questions. Thanks for putting this together.

enhancements/upgrade-readiness.md

exdx · 2020-06-02T19:30:36Z

I feel like at heart there is a more general issue - how does a complex component, like an operator, indicate that its ready to be upgraded (by OLM, or any other cluster management component). I feel like we can think about tackling this as both an upstream and downstream challenge - our solution can be a standard for different use cases.

My preference is to start simple - just a targeted annotation on the operator that indicates operator readiness. Operator authors can opt into using it by simply manipulating the annotation - adding one if the upgrade is NOT safe for some reason. OLM checks for the existence of this annotation before upgrading, otherwise proceeding as usual. I'm not sure if OLM does any existing introspection on operators it installs (that may require some design). But its a solution that would be easy to understand by members of the community and easy to implement for operator authors.

awgreene · 2020-06-02T19:39:36Z

Thanks for the review @exdx

My preference is to start simple - just a targeted annotation on the operator that indicates operator readiness. Operator authors can opt into using it by simply manipulating the annotation - adding one if the upgrade is NOT safe for some reason.

Where is this annotation applied? It would need to be some top level resource representing the installation of the opreator, likely the CSV or the Operator API that should be available soon. In either approach, we are specifically granting operators the opportunity to interact with OLM owned resources. This could introduce unintended consequences, imagine an operator with permissions to edit a CSV escalates it's RBAC privs.

But its a solution that would be easy to understand by members of the community and easy to implement for operator authors.

While simple, this is not a well defined standard. Wouldn't it be easier for users if we re-implemented the standards defined by Readiness Probes?

exdx · 2020-06-03T12:22:49Z

Thanks for the review @exdx

My preference is to start simple - just a targeted annotation on the operator that indicates operator readiness. Operator authors can opt into using it by simply manipulating the annotation - adding one if the upgrade is NOT safe for some reason.

Where is this annotation applied? It would need to be some top level resource representing the installation of the opreator, likely the CSV or the Operator API that should be available soon. In either approach, we are specifically granting operators the opportunity to interact with OLM owned resources. This could introduce unintended consequences, imagine an operator with permissions to edit a CSV escalates it's RBAC privs.

But its a solution that would be easy to understand by members of the community and easy to implement for operator authors.

While simple, this is not a well defined standard. Wouldn't it be easier for users if we re-implemented the standards defined by Readiness Probes?

My idea was to have the annotation on the operator pod itself - this is something we actually did when we wrote a real operator at my last job. Using annotations to indicate that something was ready to change, be torn down, or be updated. This approach would require OLM to discover the annotations on the operator pods that it manages before attempting an upgrade - which wouldn't be too bad since we know the name of the operator from the CSV and can just query the pod directly. I agree that asking the operator author to know about and modify the CSV or operator object is too specific and not a good idea.

The probe idea sounds good in principle but what does it entail - kubelet performs probes, not OLM. A probe describes a health check to be performed against a container to determine whether it is alive or ready to receive traffic. I don't see how we can use them for operator upgrades

ecordell

This explores the service / probe angle well but I'm not sure the other options are given enough time to breathe. Looking more at conditions as a way to report back status could fit really well with the Operator API work, which already aggregates conditions from operator components.

I also think that we should do this with an eye towards providing more general status than just "upgradeable" - we are essentially defining the communication channel between operators and OLM, and I think it's reasonable that we / some operators will want to expand that beyond whether or not they are upgradeable.

Some ideas that may help as potential things we'd want to communicate:

Metrics / key prometheus queries
Whether the operator thinks it is in a state that can be upgraded
Reporting "degraded" service like we do with CVO-operators

enhancements/upgrade-readiness.md

awgreene · 2020-06-10T04:29:16Z

Updated to capture some more recent thoughts - still needs to incorporate all the great feedback.

enhancements/upgrade-readiness.md

awgreene · 2020-06-17T12:55:20Z

enhancements/upgrade-readiness.md

+
+From the presence of the `Migrating` condition, OLM will understand that it should not upgrade the operator. The Operator also communicates that the `reason` the operator cannot be upgraded is due to a "critical process" and provides a human readable `message` for cluster admins.
+
+OLM will also highlight Conditions 1 and 2 within the new [Operator API](https://github.com/operator-framework/api/blob/master/crds/operators.coreos.com_operators.yaml) within the [condtions](https://github.com/operator-framework/api/blob/master/crds/operators.coreos.com_operators.yaml#L108) in the status.


@njhale should we create an "OLM Supported Conditions" condition array?

jupierce · 2020-06-17T14:54:45Z

enhancements/upgrade-readiness.md

+| Status | Status is the status of the condition. Can be True, False, Unknown. |
+| Reason | Unique, one-word, CamelCase reason for the condition's last transition. |
+| Message | Human-readable message indicating details about last transition. Typically used by cluster admins for debugging purposes. |
+| LastTransitionTime | Last time the condition transitioned from one status to another. |


This (or a lastHeartbeatTime) might be a useful field to enforce a hard or soft grace-period. A NotInterruptible condition must be updated at least every 30 minutes, for example. If not:

Soft grace-period - let the user know through OLM that the operator needs their attention.

Hard grace-period - assume the process in unrecoverable and allow the upgrade to continue anyway.

This is an interesting idea. My only concern is an operator that does not update the condition within the Hard grace-period could lead to severe consequences that OLM can't anticipate. If we decide to proceed down this route, it seems like an Operator Author should be able to opt-out of the hard grace-period. Thoughts?

Allowing them to opt out is worth considering, but should probably be discouraged since it instills a false confidence. A workload being able to pick up where it left off without corrupting data is part of being a k8s workload. I see this enhancement as OLM being polite. When a node is drained, or a pod evicted, the operator will not receive the same politeness (just a normal k8s graceperiod before SIGKILL).

A workload being able to pick up where it left off without corrupting data is part of being a k8s workload.

+1000, unfortunately we've seen time and again that some operators fail or are unable to deliver their operator in this format.

I see this enhancement as OLM being polite. When a node is drained, or a pod evicted, the operator will not receive the same politeness (just a normal k8s graceperiod before SIGKILL).

I'm not convinced that this is the right approach. If operators are truly successful and used by our customers to drive their business there could easily be a situation where OLM ignoring a critical condition of an operator causes direct revenue loss. Your suggestion might be the right approach, but I want other to share their opinions as well. CC @shawn-hurley @ecordell

IMO, it would be an admins choice to enable a hard grace-period or a soft grace-period and they would be responsible for the risks associated with the workloads they run. I could see lots of folks being very happy having a hard grace-period while others may very much want a soft grace period. The key here would be documenting the user choice.

I don't think that operator author would get the choice here, they get the ability to state some things and if the user want's to ignore or take other actions I don't think OLM should get in the way of that?

Also, I could be viewing this through the wrong lense?

If this is something we are debating, are we adding a user story for it or capturing it for a follow up somewhere? I don't want to lose the idea.

IMO, it would be an admins choice to enable a hard grace-period or a soft grace-period and they would be responsible for the risks associated with the workloads they run. I could see lots of folks being very happy having a hard grace-period while others may very much want a soft grace period. The key here would be documenting the user choice.
I don't think that operator author would get the choice here, they get the ability to state some things and if the user want's to ignore or take other actions I don't think OLM should get in the way of that

100%

If this is something we are debating, are we adding a user story for it or capturing it for a follow up somewhere? I don't want to lose the idea.

I can update this enhancement or create it as an epic we introduce later to keep the scope of this enhancement smaller - whichever you prefer.

I think we could get something similar -- w/o putting the onus on the operator to supply the heartbeat -- by supporting "condition source" combinations; e.g. upgradeable=f(CR, pod), where perhaps the pod being ready is required for f(CR, pod)=true.

exdx · 2020-06-17T15:08:18Z

enhancements/upgrade-readiness.md

+
+| Condition Name | Effect |
+| -------------- | ------ |
+| NotInterruptible | When true, OLM will not upgrade the operator. The pod associated with the operator may still be killed by external forces.|


Based on the description you provided, should this be NotUpgradable instead? As you noted, the the process can still be interrupted by external forces.

Also, to minimize the potential for OLM to pickup conditions that are not intended for it by the operator (false positives) would it make sense to have an OLM prefix or something for the condition? Like OLM.NotUpgradable = true

Based on the description you provided, should this be NotUpgradable instead? As you noted, the the process can still be interrupted by external forces.

I had considered naming this condition NotUpgradeable but was concerned that we might want to use this condition for additional purposes down the road. I believed that NotInterruptible gave us enough wiggle room to allow operator to express their intent by setting the condition while giving OLM enough wiggle-room to expand on the behavior as we can make additional guarantees. An argument against this may be changing default upgrade behavior, but I assumed that the benefit of keeping "Operator Intent" clear was worth it.

Also, to minimize the potential for OLM to pickup conditions that are not intended for it by the operator (false positives) would it make sense to have an OLM prefix or something for the condition? Like OLM.NotUpgradable = true

I believe this concern is discussed here. Basically, by requiring an operator to define the mapping there is no room for the conflict to occur.

I see where you're coming from -- we want to have some wiggle room. But semantically, NotUpgradeable feels like its more related to OLM. OLM is responsible for the upgrading of the operator. NotInterruptible feels like something a PDB would cover - by saying maxUnavailable =0. Also, maybe the operator author wants to interrupt their CR process (if it breaks) but still does not want an upgrade. By requiring them to write NotInterruptible to their status it could potentially be confusing to them.

But I don't have my heart set on it. Curious what @njhale @ecordell think on this.

As to the second point, I see. They have to explicitly set the annotation on the CRD for the status update on the CR to have any bearing. That sounds good, as long as we document that clearly.

exdx · 2020-06-17T15:12:59Z

enhancements/upgrade-readiness.md

+
+OLM will also highlight Conditions 1 and 2 within the new [Operator API](https://github.com/operator-framework/api/blob/master/crds/operators.coreos.com_operators.yaml) within the [condtions](https://github.com/operator-framework/api/blob/master/crds/operators.coreos.com_operators.yaml#L108) in the status.
+
+Note that by adding the `NotInterruptible` "OLM Supported Condition" to each CR, OLM will not upgrade the operator if any CR reports this condition as `true`.


This can block operator upgrades if a particular CR ends up in a bad state and is unable to recover (for example the DB migration fails). I think it's valid to block the upgrade in that case regardless but its worth pointing out - at that point only a manual cleanup will allow for an upgrade.

Suggestion here to help UX in this scenario: #23 (comment)

~~I think requiring cluster admin intervention is the safest approach, but @jupierce offered an interesting alternative.~~

Edit: Echo echo...

I think that's a very interesting idea, to have hard and soft limits for the condition and proceeding anyway. Some thoughts around this:

By mandating a certain status gets updated every 30 minutes we are really starting to mandate how operator authors write their operator, which lowers adoption. Asking them to write a certain condition to their status is straightforward enough (they are doing it anyway) but saying to do it periodically is different. The logic definitely grows more complex, and I think we would be the ones to have to provide a library for them to use.

Conceptually, a grace period is used when something is wrong. I don't think NotInterruptible is necessarily a bad condition that an operator needs to get out of - it's a transient condition that indicates a workload is in progress and has state, like an ML training job. We would expect that eventually the workload comes out of that state and the operator goes back to an upgradable state.

as @awgreene mentioned, OLM terminating workloads without user input is pretty risky behavior in general. Any deletion/upgrade event is initiated by the user, either by deleting the CSV or the subscription moving to a new version. The subscription makes the process seamless but still the user updates the catalog on the cluster with the new version, either explicitly or via catalog polling. It's really up to the operator author to make sure their operands are well behaved.

All that being said, I think you suggestion brings up a need for an example, in code, of how to set and unset the condition and best practices for it. That way operator authors have a clear picture of how to use the condition to their benefit.

exdx · 2020-06-17T15:22:22Z

enhancements/upgrade-readiness.md

+
+### How Operators will opt-in
+
+Operators must opt-in to "OLM Supported Conditions" by adding an annotation to their owned CRD, where the key is the defined by OLM and the value is the `type` of the condition on the CR.


What if the CR that has the not condition is from a required CRD, and not an owned CRD?

For example, vault has a requirement on etcd. Say the vault migration fails due to a etcd cluster in a bad state - the vault operator would write the condition to the vault CR? That seems to work. I feel like there is some additional complexity in this area that is worth looking at - are we covering all the cases.

This is a great point and should be discussed in greater detail in this enhancement. I believe that OLM should consider if an operator can be upgraded when calculating a new Generation.

kevinrizza · 2020-06-18T13:43:48Z

/lgtm

njhale

Looking really good. I left some feedback.

enhancements/upgrade-readiness.md

shawn-hurley

I think that we need to add the second half of this, which is the operator can write directly to the Probe resource.

enhancements/upgrade-readiness.md

shawn-hurley · 2020-07-16T20:31:08Z

enhancements/upgrade-readiness.md

+    status: false
+    reason: "NotUpgradeable"
+    message: "The operator has communicated that the operator is not upgradeable"
+  probeResources:


this is going to get un-readable once you add three CR's nevermind 100.

Can these instead become refs to the CR's that are causing the condition to trigger?

probeResources: - name: foo-example namespace: default kind: Foo group: foo.example.com reason: "!Migrating" ``` I get the desire to include the entire decision, I am just worried that it will be too much for a user to read.

That's fair - I am fine removing the list of conditions and simply including the reason.

Given that a single resource may cause multiple OLM Supported Conditions to fire, I suggest making the reason field an array of strings.

Do we need to explain this at all? Or can we have something more opaque like relatedObjects that just points at the involved resources, and leaves it up the the included message and external docs like this enhancement to spell out the important pieces (e.g. default/foo-example has UnhealthyDatabase and BadConnectivity instead of the example's current, generic The operator has communicated that the operator is not upgradeable).

enhancements/upgrade-readiness.md

awgreene · 2020-07-17T15:40:52Z

enhancements/upgrade-readiness.md

+spec:
+  specOverride: crdAnnotations # Changes to the CRD's annotations should override the spec. SPO will watch the CRD for changes to the "OLM Supported Condition" annotations and update the spec accordingly.
+  probeResources:
+  - resource: foo.example.com


This should be the CRD name.

ecordell · 2020-07-27T20:58:28Z

/lgtm

enhancements/upgrade-readiness.md

wking · 2020-08-13T03:58:49Z

enhancements/upgrade-readiness.md

+
+A core requirement of this enhancement is to provide operators with the ability to communicate "Upgrade Readiness", providing operators with the ability to signal to OLM that they should not be upgraded at this time. Consider the following scenario based on OLM today:
+
+>An operator installed with automatic updates is executing a critical process that could place the cluster in a bad state if interrupted. While this critical process is running, a new version of the operator becomes available. Since the operator was installed with automatic updates, OLM upgrades the operator without allowing the critical process to finish and places the cluster in an unrecoverable state.


This example conflicts with the non-goal:

OLM cannot guarantee that some outside force will not interrupt the services of the operator and will not provide the operator with the ability to communicate that it should not be "Interrupted".

Personally, I prefer the non-goal. The solution to "things break when I do dangerous, irrecoverable things and get updated part-way through" is "don't do dangerous, irrecoverable things". A better example for this enhancement might be "later versions of my operator, which you might be considering updating to, are not compatible with the current cluster state. If you update me to those versions, I will break". Or something like that?

ersonally, I prefer the non-goal. The solution to "things break when I do dangerous, irrecoverable things and get updated part-way through" is "don't do dangerous, irrecoverable things".

This example is meant to highlight that OLM will not start the upgrade process while an operator is communicating that it cannot be upgraded to the next version.

A better example for this enhancement might be "later versions of my operator, which you might be considering updating to, are not compatible with the current cluster state. If you update me to those versions, I will break". Or something like that?

As @ecordell discussed, this example is slightly different than the issues we are trying to address in this enhancement.

wking · 2020-08-13T04:08:44Z

enhancements/upgrade-readiness.md

+
+| Condition Type | Effect |
+| -------------- | ------ |
+| Upgradeable | When `false`, OLM will not upgrade the operator.|


In OpenShift, we pivoted Upgradeable to only block minor+ bumps, and to no longer block patch bumps (rhbz#1797624). That allows you to gracefully recover from:

I released a broken patch which sets Upgradeable=False more aggressively than I meant, and now I want to left folks perform patch updates out of that version without having to jump through whatever hoops the overly-strict version required to clear Upgradeable=False.

However, it also relies on two assumptions:

Operator workloads are always interruptible for patch bumps. Which you definitely want, but this enhancement seems maybe not ready to commit to (e.g. here).

Patch bumps never make any changes large enough that you'd actually want to block them. Which you also want, but I have no idea how well folks SemVer their operators, if they SemVer them at all...

enhancements/upgrade-readiness.md

bparees · 2020-08-13T13:33:16Z

enhancements/upgrade-readiness.md

+### Goals
+
+- Provide operators with the means to explicitly communicate a condition that influences OLM behavior, such as the "Upgrade Readiness" condition defined above.
+- Preserve existing behavior if the operator does not take advantage of this feature.


is there an intent here that OLM will then propagate this status to the CVO in order to block OCP upgrades under certain conditions?

Operators saying "don't update me" and "don't update the host cluster where I live" seem like two different things.

it's somewhat orthogonal but i bring it up because we are moving forward with documentation that says "make sure all your operators are at the latest version before your upgrade your cluster", so if we know it's physically impossible for your operators to be at the latest version for some reason, it seems worth considering if we should also blocked the cluster upgrade. (in fact even if it's possible to upgrade your operators but you haven't done so, it would also be nice to block the upgrade).

does not have to be covered here, but wanted it in people's heads.

It's not explicit here (because this is an operator-framework enhancement) but one goal here is to allow more communication between OLM <> Operators, and in the future use that to inform the communication between OLM <> CVO and decisions around cluster upgrades. That will be spelled out in a future openshift enhancement, though

The communication channel with OLM is valuable independent of CVO/upgrades, so we're tackling that here.

It might also be worth changing Upgradeable to OperatorUpgradeable or some such in this enhancement to distinguish between the two cases, so we don't end up with ClusterUpgradeable about "please don't bump the host system" and an unqualified Upgradeable about "please don't bump me, the operator".

It might also be worth changing Upgradeable to OperatorUpgradeable or some such in this enhancement to distinguish between the two cases, so we don't end up with ClusterUpgradeable about "please don't bump the host system" and an unqualified Upgradeable about "please don't bump me, the operator".

I don't believe that this will be an issue given that this communication channel will exist between OLM and the operator that it manages. OLM would still communicate to CVO using the existing clusteroperator resource.

So how would OLM know which sub-operator Upgradeable=False we're "hold the core, my current version is incompatible with future core APIs" and should be passed along in the ClusterOperator and which were "don't bump me, my future versions will break on the current cluster" and should not be passed on?

The OLM team has discussed the scenarios that @wking is describing but I don't believe that we've made any final decisions.

That being said, I don't know that operators will be responsible for handling either of these scenarios. They certainly can be, but it places a lot of burden on the operators to understand the requirements of an operator that hasn't been released yet (version n of an operator must know the requirements of operator version n+1).

In the past, we had discussed having OLM make intelligent decisions when upgrading an operator by looking at available APIs and the RBAC associated with the operator. In this scenario, OLM could identify both of the cases @wking mentioned.

Again, I haven't kept up with this aspect of OLM and would welcome a comment from @ecordell if I am operating off stale info.

@wking that's the "other side" of this problem (and isn't something that should be addressed in this operator-framework enhancement).

This enhancement is just for operators communicating to OLM that they're in the middle of something important, which has been a long-standing problem for upgrades.

With this enhancement done, OLM could later use that to inform CVO that an operator is currently not in an interruptible state (which likely should mean reporting itself Upgradeable=false with the ability for an admin to indicate that they don't care about interrupting certain operators).

But that's not sufficient to determine whether a cluster upgrade is safe. The latest thing we've discussed on that front is that we probably want to default to OLM reporting Upgradeable=false to CVO for any operator that's installed that doesn't have an explicit "I work on OCP version X" where X is the next minor release (the mechanism of how we get that information still under some discussion) - but that will be a new enhancement in openshift/enhancements, since it's OCP-specific.

With this enhancement done, OLM could later use that to inform CVO that an operator is currently not in an interruptible state (which likely should mean reporting itself Upgradeable=false with the ability for an admin to indicate that they don't care about interrupting certain operators).

On the CVO side, Upgradeable=False only blocks minor updates, not patch level updates. If a particular workload wants to mark itself uninterruptible, or otherwise make claims about how interruptible it is, it should use pod disruption budgets. That will also protect you from things like deschedulers. So maybe the only thing that needs a new, update-specific channel is "I'm not comfortable with the host environment or other external dependencies updating right now".

openshift-ci-robot · 2020-08-20T20:12:52Z

New changes are detected. LGTM label has been removed.

awgreene · 2020-08-27T15:29:09Z

Closing this in favor of #42 with the expectation that we will eventually revisit the discussion on how to provide operators with the means to communicate complex states to OLM without writing to an OLM CRD directly.

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 26, 2020

mhrivnak reviewed May 27, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

mhrivnak reviewed May 27, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

camilamacedo86 requested a review from mhrivnak May 28, 2020 13:42

estroz reviewed Jun 2, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

estroz reviewed Jun 2, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

estroz reviewed Jun 2, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

Bowenislandsong reviewed Jun 2, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

enhancements/upgrade-readiness.md Show resolved Hide resolved

ecordell reviewed Jun 3, 2020

View reviewed changes

awgreene force-pushed the upgrade-readiness branch from 0015571 to a8bf0d2 Compare June 3, 2020 23:16

ecordell mentioned this pull request Jun 5, 2020

Don't use readiness condition to detect if updates are in progress operator-framework/operator-lifecycle-manager#922

Closed

nunnatsa reviewed Jun 8, 2020

View reviewed changes

enhancements/upgrade-readiness.md Show resolved Hide resolved

davidvossel reviewed Jun 8, 2020

View reviewed changes

enhancements/upgrade-readiness.md Show resolved Hide resolved

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

jmccormick2001 reviewed Jun 8, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

awgreene changed the title ~~WIP: Upgrade Readiness Enhancement~~ Operator Status <Upgrade Readiness> Enhancement Jun 16, 2020

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 16, 2020

njhale reviewed Jun 16, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

exdx reviewed Jun 16, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

exdx reviewed Jun 16, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

awgreene commented Jun 17, 2020

View reviewed changes

jupierce reviewed Jun 17, 2020

View reviewed changes

exdx reviewed Jun 17, 2020

View reviewed changes

njhale reviewed Jul 8, 2020

View reviewed changes

njhale reviewed Jul 15, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

njhale reviewed Jul 15, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

ecordell reviewed Jul 16, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

shawn-hurley requested changes Jul 16, 2020

View reviewed changes

njhale reviewed Jul 17, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

njhale reviewed Jul 17, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

njhale reviewed Jul 17, 2020

View reviewed changes

enhancements/upgrade-readiness.md Show resolved Hide resolved

awgreene commented Jul 17, 2020

View reviewed changes

awgreene force-pushed the upgrade-readiness branch from f634b2f to 7920db6 Compare July 21, 2020 15:51

openshift-ci-robot assigned ecordell Jul 27, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 27, 2020

wking reviewed Aug 13, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

wking reviewed Aug 13, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

wking reviewed Aug 13, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

wking reviewed Aug 13, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

wking reviewed Aug 13, 2020

View reviewed changes

enhancements/upgrade-readiness.md Outdated Show resolved Hide resolved

wking reviewed Aug 13, 2020

View reviewed changes

enhancements/upgrade-readiness.md Show resolved Hide resolved

bparees reviewed Aug 13, 2020

View reviewed changes

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Aug 20, 2020

Upgrade Readiness Enhancement

b003683

awgreene force-pushed the upgrade-readiness branch from f819132 to b003683 Compare August 20, 2020 20:51

awgreene mentioned this pull request Aug 25, 2020

Operator Conditions Enhancement #42

Merged

awgreene closed this Aug 27, 2020

awgreene mentioned this pull request Sep 30, 2020

Update Operator-Status enhancement #51

Merged


		From the presence of the `Migrating` condition, OLM will understand that it should not upgrade the operator. The Operator also communicates that the `reason` the operator cannot be upgraded is due to a "critical process" and provides a human readable `message` for cluster admins.

		OLM will also highlight Conditions 1 and 2 within the new [Operator API](https://github.com/operator-framework/api/blob/master/crds/operators.coreos.com_operators.yaml) within the [condtions](https://github.com/operator-framework/api/blob/master/crds/operators.coreos.com_operators.yaml#L108) in the status.


		OLM will also highlight Conditions 1 and 2 within the new [Operator API](https://github.com/operator-framework/api/blob/master/crds/operators.coreos.com_operators.yaml) within the [condtions](https://github.com/operator-framework/api/blob/master/crds/operators.coreos.com_operators.yaml#L108) in the status.

		Note that by adding the `NotInterruptible` "OLM Supported Condition" to each CR, OLM will not upgrade the operator if any CR reports this condition as `true`.


		### How Operators will opt-in

		Operators must opt-in to "OLM Supported Conditions" by adding an annotation to their owned CRD, where the key is the defined by OLM and the value is the `type` of the condition on the CR.


		A core requirement of this enhancement is to provide operators with the ability to communicate "Upgrade Readiness", providing operators with the ability to signal to OLM that they should not be upgraded at this time. Consider the following scenario based on OLM today:

		>An operator installed with automatic updates is executing a critical process that could place the cluster in a bad state if interrupted. While this critical process is running, a new version of the operator becomes available. Since the operator was installed with automatic updates, OLM upgrades the operator without allowing the critical process to finish and places the cluster in an unrecoverable state.

Operator Status <Upgrade Readiness> Enhancement #23

Operator Status <Upgrade Readiness> Enhancement #23

Conversation

awgreene commented May 26, 2020

mhrivnak commented May 27, 2020

awgreene commented May 28, 2020

awgreene commented May 28, 2020 • edited Loading

Bowenislandsong left a comment

Choose a reason for hiding this comment

exdx commented Jun 2, 2020

awgreene commented Jun 2, 2020 • edited Loading

exdx commented Jun 3, 2020

ecordell left a comment

Choose a reason for hiding this comment

awgreene commented Jun 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awgreene Jun 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awgreene Jun 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinrizza commented Jun 18, 2020

njhale left a comment

Choose a reason for hiding this comment

shawn-hurley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awgreene Jul 17, 2020 • edited Loading

Choose a reason for hiding this comment

ecordell commented Jul 27, 2020

wking Aug 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ecordell Aug 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awgreene Aug 20, 2020 • edited Loading

Choose a reason for hiding this comment

ecordell Aug 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci-robot commented Aug 20, 2020

awgreene commented Aug 27, 2020

awgreene commented May 28, 2020 •

edited

Loading

awgreene commented Jun 2, 2020 •

edited

Loading

awgreene Jun 17, 2020 •

edited

Loading

awgreene Jun 17, 2020 •

edited

Loading

awgreene Jul 17, 2020 •

edited

Loading

wking Aug 13, 2020 •

edited

Loading

ecordell Aug 13, 2020 •

edited

Loading

awgreene Aug 20, 2020 •

edited

Loading

ecordell Aug 20, 2020 •

edited

Loading