Skip to content

Operator Status <Upgrade Readiness> Enhancement #23

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

awgreene
Copy link
Member

No description provided.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 26, 2020
@mhrivnak
Copy link
Member

If an operator is in the middle of performing this critical work and dies for some reason (bug, OOM, hardware failure, etc), it's important that it be able to resume that work as soon as it can re-start. We wouldn't want to upgrade the operator in such a scenario, when OLM is not able to reach the proposed endpoint. Would you agree? If so, it's worth capturing that behavior in this proposal: OLM will not initiate an upgrade without positive confirmation that the operator is ready for it.

But then we need an escape hatch. There will be scenarios where either the operator can't function correctly until it gets updated (so the endpoint is down), or there's a bug in its implementation of this endpoint, so maybe it's always reporting not-ready-for-upgrade. There needs to be a way to force the update even when the operator either can't be reached or says it can't be updated. Any thoughts on what that mechanism would be?

@camilamacedo86 camilamacedo86 requested a review from mhrivnak May 28, 2020 13:42
@awgreene
Copy link
Member Author

Hello @mhrivnak,

Thanks for the detailed review!

If an operator is in the middle of performing this critical work and dies for some reason (bug, OOM, hardware failure, etc), it's important that it be able to resume that work as soon as it can re-start. We wouldn't want to upgrade the operator in such a scenario, when OLM is not able to reach the proposed endpoint. Would you agree? If so, it's worth capturing that behavior in this proposal: OLM will not initiate an upgrade without positive confirmation that the operator is ready for it.

Agreed - I believe if an operator defines an upgrade readiness endpoint OLM should not upgrade an operator when it is unable to connect to the endpoint. I'll update the enhancement.

But then we need an escape hatch. There will be scenarios where either the operator can't function correctly until it gets updated (so the endpoint is down), or there's a bug in its implementation of this endpoint, so maybe it's always reporting not-ready-for-upgrade. There needs to be a way to force the update even when the operator either can't be reached or says it can't be updated. Any thoughts on what that mechanism would be?

There is a manual process where the user can delete the subscription/csv to uninstall the operator and reinstall the operator at the desired version.

@awgreene
Copy link
Member Author

awgreene commented May 28, 2020

If an operator is in the middle of performing this critical work and dies for some reason (bug, OOM, hardware failure, etc), it's important that it be able to resume that work as soon as it can re-start. We wouldn't want to upgrade the operator in such a scenario, when OLM is not able to reach the proposed endpoint. Would you agree? If so, it's worth capturing that behavior in this proposal: OLM will not initiate an upgrade without positive confirmation that the operator is ready for it.

@mhrivnak I am also going to make sure that we call out that the Upgrade Readiness Check is opt-in, meaning OLM will default to existing in the absence of an upgrade readiness endpoint, in the interest of support backwards compatibility.

Copy link
Member

@Bowenislandsong Bowenislandsong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my comments are centered around one issue and hope to not cut a new version for CSV. These are mostly my concerns and questions. Thanks for putting this together.

@exdx
Copy link
Member

exdx commented Jun 2, 2020

I feel like at heart there is a more general issue - how does a complex component, like an operator, indicate that its ready to be upgraded (by OLM, or any other cluster management component). I feel like we can think about tackling this as both an upstream and downstream challenge - our solution can be a standard for different use cases.

My preference is to start simple - just a targeted annotation on the operator that indicates operator readiness. Operator authors can opt into using it by simply manipulating the annotation - adding one if the upgrade is NOT safe for some reason. OLM checks for the existence of this annotation before upgrading, otherwise proceeding as usual. I'm not sure if OLM does any existing introspection on operators it installs (that may require some design). But its a solution that would be easy to understand by members of the community and easy to implement for operator authors.

@awgreene
Copy link
Member Author

awgreene commented Jun 2, 2020

Thanks for the review @exdx

My preference is to start simple - just a targeted annotation on the operator that indicates operator readiness. Operator authors can opt into using it by simply manipulating the annotation - adding one if the upgrade is NOT safe for some reason.

Where is this annotation applied? It would need to be some top level resource representing the installation of the opreator, likely the CSV or the Operator API that should be available soon. In either approach, we are specifically granting operators the opportunity to interact with OLM owned resources. This could introduce unintended consequences, imagine an operator with permissions to edit a CSV escalates it's RBAC privs.

But its a solution that would be easy to understand by members of the community and easy to implement for operator authors.

While simple, this is not a well defined standard. Wouldn't it be easier for users if we re-implemented the standards defined by Readiness Probes?

@exdx
Copy link
Member

exdx commented Jun 3, 2020

Thanks for the review @exdx

My preference is to start simple - just a targeted annotation on the operator that indicates operator readiness. Operator authors can opt into using it by simply manipulating the annotation - adding one if the upgrade is NOT safe for some reason.

Where is this annotation applied? It would need to be some top level resource representing the installation of the opreator, likely the CSV or the Operator API that should be available soon. In either approach, we are specifically granting operators the opportunity to interact with OLM owned resources. This could introduce unintended consequences, imagine an operator with permissions to edit a CSV escalates it's RBAC privs.

But its a solution that would be easy to understand by members of the community and easy to implement for operator authors.

While simple, this is not a well defined standard. Wouldn't it be easier for users if we re-implemented the standards defined by Readiness Probes?

My idea was to have the annotation on the operator pod itself - this is something we actually did when we wrote a real operator at my last job. Using annotations to indicate that something was ready to change, be torn down, or be updated. This approach would require OLM to discover the annotations on the operator pods that it manages before attempting an upgrade - which wouldn't be too bad since we know the name of the operator from the CSV and can just query the pod directly. I agree that asking the operator author to know about and modify the CSV or operator object is too specific and not a good idea.

The probe idea sounds good in principle but what does it entail - kubelet performs probes, not OLM. A probe describes a health check to be performed against a container to determine whether it is alive or ready to receive traffic. I don't see how we can use them for operator upgrades

Copy link
Member

@ecordell ecordell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explores the service / probe angle well but I'm not sure the other options are given enough time to breathe. Looking more at conditions as a way to report back status could fit really well with the Operator API work, which already aggregates conditions from operator components.

I also think that we should do this with an eye towards providing more general status than just "upgradeable" - we are essentially defining the communication channel between operators and OLM, and I think it's reasonable that we / some operators will want to expand that beyond whether or not they are upgradeable.

Some ideas that may help as potential things we'd want to communicate:

  • Metrics / key prometheus queries
  • Whether the operator thinks it is in a state that can be upgraded
  • Reporting "degraded" service like we do with CVO-operators

@awgreene
Copy link
Member Author

Updated to capture some more recent thoughts - still needs to incorporate all the great feedback.

@awgreene awgreene changed the title WIP: Upgrade Readiness Enhancement Operator Status <Upgrade Readiness> Enhancement Jun 16, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 16, 2020

From the presence of the `Migrating` condition, OLM will understand that it should not upgrade the operator. The Operator also communicates that the `reason` the operator cannot be upgraded is due to a "critical process" and provides a human readable `message` for cluster admins.

OLM will also highlight Conditions 1 and 2 within the new [Operator API](https://github.com/operator-framework/api/blob/master/crds/operators.coreos.com_operators.yaml) within the [condtions](https://github.com/operator-framework/api/blob/master/crds/operators.coreos.com_operators.yaml#L108) in the status.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njhale should we create an "OLM Supported Conditions" condition array?

| Status | Status is the status of the condition. Can be True, False, Unknown. |
| Reason | Unique, one-word, CamelCase reason for the condition's last transition. |
| Message | Human-readable message indicating details about last transition. Typically used by cluster admins for debugging purposes. |
| LastTransitionTime | Last time the condition transitioned from one status to another. |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (or a lastHeartbeatTime) might be a useful field to enforce a hard or soft grace-period. A NotInterruptible condition must be updated at least every 30 minutes, for example. If not:

  1. Soft grace-period - let the user know through OLM that the operator needs their attention.
  2. Hard grace-period - assume the process in unrecoverable and allow the upgrade to continue anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting idea. My only concern is an operator that does not update the condition within the Hard grace-period could lead to severe consequences that OLM can't anticipate. If we decide to proceed down this route, it seems like an Operator Author should be able to opt-out of the hard grace-period. Thoughts?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing them to opt out is worth considering, but should probably be discouraged since it instills a false confidence. A workload being able to pick up where it left off without corrupting data is part of being a k8s workload. I see this enhancement as OLM being polite. When a node is drained, or a pod evicted, the operator will not receive the same politeness (just a normal k8s graceperiod before SIGKILL).

Copy link
Member Author

@awgreene awgreene Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A workload being able to pick up where it left off without corrupting data is part of being a k8s workload.

+1000, unfortunately we've seen time and again that some operators fail or are unable to deliver their operator in this format.

I see this enhancement as OLM being polite. When a node is drained, or a pod evicted, the operator will not receive the same politeness (just a normal k8s graceperiod before SIGKILL).

I'm not convinced that this is the right approach. If operators are truly successful and used by our customers to drive their business there could easily be a situation where OLM ignoring a critical condition of an operator causes direct revenue loss. Your suggestion might be the right approach, but I want other to share their opinions as well. CC @shawn-hurley @ecordell

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it would be an admins choice to enable a hard grace-period or a soft grace-period and they would be responsible for the risks associated with the workloads they run. I could see lots of folks being very happy having a hard grace-period while others may very much want a soft grace period. The key here would be documenting the user choice.

I don't think that operator author would get the choice here, they get the ability to state some things and if the user want's to ignore or take other actions I don't think OLM should get in the way of that?

Also, I could be viewing this through the wrong lense?

If this is something we are debating, are we adding a user story for it or capturing it for a follow up somewhere? I don't want to lose the idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, it would be an admins choice to enable a hard grace-period or a soft grace-period and they would be responsible for the risks associated with the workloads they run. I could see lots of folks being very happy having a hard grace-period while others may very much want a soft grace period. The key here would be documenting the user choice.
I don't think that operator author would get the choice here, they get the ability to state some things and if the user want's to ignore or take other actions I don't think OLM should get in the way of that

100%

If this is something we are debating, are we adding a user story for it or capturing it for a follow up somewhere? I don't want to lose the idea.

I can update this enhancement or create it as an epic we introduce later to keep the scope of this enhancement smaller - whichever you prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could get something similar -- w/o putting the onus on the operator to supply the heartbeat -- by supporting "condition source" combinations; e.g. upgradeable=f(CR, pod), where perhaps the pod being ready is required for f(CR, pod)=true.


| Condition Name | Effect |
| -------------- | ------ |
| NotInterruptible | When true, OLM will not upgrade the operator. The pod associated with the operator may still be killed by external forces.|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the description you provided, should this be NotUpgradable instead? As you noted, the the process can still be interrupted by external forces.

Also, to minimize the potential for OLM to pickup conditions that are not intended for it by the operator (false positives) would it make sense to have an OLM prefix or something for the condition? Like OLM.NotUpgradable = true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the description you provided, should this be NotUpgradable instead? As you noted, the the process can still be interrupted by external forces.

I had considered naming this condition NotUpgradeable but was concerned that we might want to use this condition for additional purposes down the road. I believed that NotInterruptible gave us enough wiggle room to allow operator to express their intent by setting the condition while giving OLM enough wiggle-room to expand on the behavior as we can make additional guarantees. An argument against this may be changing default upgrade behavior, but I assumed that the benefit of keeping "Operator Intent" clear was worth it.

Also, to minimize the potential for OLM to pickup conditions that are not intended for it by the operator (false positives) would it make sense to have an OLM prefix or something for the condition? Like OLM.NotUpgradable = true

I believe this concern is discussed here. Basically, by requiring an operator to define the mapping there is no room for the conflict to occur.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see where you're coming from -- we want to have some wiggle room. But semantically, NotUpgradeable feels like its more related to OLM. OLM is responsible for the upgrading of the operator. NotInterruptible feels like something a PDB would cover - by saying maxUnavailable =0. Also, maybe the operator author wants to interrupt their CR process (if it breaks) but still does not want an upgrade. By requiring them to write NotInterruptible to their status it could potentially be confusing to them.

But I don't have my heart set on it. Curious what @njhale @ecordell think on this.

As to the second point, I see. They have to explicitly set the annotation on the CRD for the status update on the CR to have any bearing. That sounds good, as long as we document that clearly.


OLM will also highlight Conditions 1 and 2 within the new [Operator API](https://github.com/operator-framework/api/blob/master/crds/operators.coreos.com_operators.yaml) within the [condtions](https://github.com/operator-framework/api/blob/master/crds/operators.coreos.com_operators.yaml#L108) in the status.

Note that by adding the `NotInterruptible` "OLM Supported Condition" to each CR, OLM will not upgrade the operator if any CR reports this condition as `true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can block operator upgrades if a particular CR ends up in a bad state and is unable to recover (for example the DB migration fails). I think it's valid to block the upgrade in that case regardless but its worth pointing out - at that point only a manual cleanup will allow for an upgrade.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion here to help UX in this scenario: #23 (comment)

Copy link
Member Author

@awgreene awgreene Jun 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think requiring cluster admin intervention is the safest approach, but @jupierce offered an interesting alternative.

Edit: Echo echo...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a very interesting idea, to have hard and soft limits for the condition and proceeding anyway. Some thoughts around this:

  • By mandating a certain status gets updated every 30 minutes we are really starting to mandate how operator authors write their operator, which lowers adoption. Asking them to write a certain condition to their status is straightforward enough (they are doing it anyway) but saying to do it periodically is different. The logic definitely grows more complex, and I think we would be the ones to have to provide a library for them to use.
  • Conceptually, a grace period is used when something is wrong. I don't think NotInterruptible is necessarily a bad condition that an operator needs to get out of - it's a transient condition that indicates a workload is in progress and has state, like an ML training job. We would expect that eventually the workload comes out of that state and the operator goes back to an upgradable state.
  • as @awgreene mentioned, OLM terminating workloads without user input is pretty risky behavior in general. Any deletion/upgrade event is initiated by the user, either by deleting the CSV or the subscription moving to a new version. The subscription makes the process seamless but still the user updates the catalog on the cluster with the new version, either explicitly or via catalog polling. It's really up to the operator author to make sure their operands are well behaved.

All that being said, I think you suggestion brings up a need for an example, in code, of how to set and unset the condition and best practices for it. That way operator authors have a clear picture of how to use the condition to their benefit.


### How Operators will opt-in

Operators must opt-in to "OLM Supported Conditions" by adding an annotation to their owned CRD, where the key is the defined by OLM and the value is the `type` of the condition on the CR.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the CR that has the not condition is from a required CRD, and not an owned CRD?

For example, vault has a requirement on etcd. Say the vault migration fails due to a etcd cluster in a bad state - the vault operator would write the condition to the vault CR? That seems to work. I feel like there is some additional complexity in this area that is worth looking at - are we covering all the cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point and should be discussed in greater detail in this enhancement. I believe that OLM should consider if an operator can be upgraded when calculating a new Generation.

@kevinrizza
Copy link
Member

/lgtm

Copy link
Member

@njhale njhale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good. I left some feedback.

Copy link
Member

@shawn-hurley shawn-hurley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we need to add the second half of this, which is the operator can write directly to the Probe resource.

status: false
reason: "NotUpgradeable"
message: "The operator has communicated that the operator is not upgradeable"
probeResources:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to get un-readable once you add three CR's nevermind 100.

Can these instead become refs to the CR's that are causing the condition to trigger?

probeResources:
- name: foo-example
  namespace: default
  kind: Foo
  group: foo.example.com
  reason: "!Migrating"
```

I get the desire to include the entire decision, I am just worried that it will be too much for a user to read.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair - I am fine removing the list of conditions and simply including the reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that a single resource may cause multiple OLM Supported Conditions to fire, I suggest making the reason field an array of strings.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to explain this at all? Or can we have something more opaque like relatedObjects that just points at the involved resources, and leaves it up the the included message and external docs like this enhancement to spell out the important pieces (e.g. default/foo-example has UnhealthyDatabase and BadConnectivity instead of the example's current, generic The operator has communicated that the operator is not upgradeable).

spec:
specOverride: crdAnnotations # Changes to the CRD's annotations should override the spec. SPO will watch the CRD for changes to the "OLM Supported Condition" annotations and update the spec accordingly.
probeResources:
- resource: foo.example.com
Copy link
Member Author

@awgreene awgreene Jul 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the CRD name.

@awgreene awgreene force-pushed the upgrade-readiness branch from f634b2f to 7920db6 Compare July 21, 2020 15:51
@ecordell
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 27, 2020

A core requirement of this enhancement is to provide operators with the ability to communicate "Upgrade Readiness", providing operators with the ability to signal to OLM that they should not be upgraded at this time. Consider the following scenario based on OLM today:

>An operator installed with automatic updates is executing a critical process that could place the cluster in a bad state if interrupted. While this critical process is running, a new version of the operator becomes available. Since the operator was installed with automatic updates, OLM upgrades the operator without allowing the critical process to finish and places the cluster in an unrecoverable state.
Copy link

@wking wking Aug 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example conflicts with the non-goal:

OLM cannot guarantee that some outside force will not interrupt the services of the operator and will not provide the operator with the ability to communicate that it should not be "Interrupted".

Personally, I prefer the non-goal. The solution to "things break when I do dangerous, irrecoverable things and get updated part-way through" is "don't do dangerous, irrecoverable things". A better example for this enhancement might be "later versions of my operator, which you might be considering updating to, are not compatible with the current cluster state. If you update me to those versions, I will break". Or something like that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ersonally, I prefer the non-goal. The solution to "things break when I do dangerous, irrecoverable things and get updated part-way through" is "don't do dangerous, irrecoverable things".

This example is meant to highlight that OLM will not start the upgrade process while an operator is communicating that it cannot be upgraded to the next version.

A better example for this enhancement might be "later versions of my operator, which you might be considering updating to, are not compatible with the current cluster state. If you update me to those versions, I will break". Or something like that?

As @ecordell discussed, this example is slightly different than the issues we are trying to address in this enhancement.


| Condition Type | Effect |
| -------------- | ------ |
| Upgradeable | When `false`, OLM will not upgrade the operator.|
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In OpenShift, we pivoted Upgradeable to only block minor+ bumps, and to no longer block patch bumps (rhbz#1797624). That allows you to gracefully recover from:

I released a broken patch which sets Upgradeable=False more aggressively than I meant, and now I want to left folks perform patch updates out of that version without having to jump through whatever hoops the overly-strict version required to clear Upgradeable=False.

However, it also relies on two assumptions:

  • Operator workloads are always interruptible for patch bumps. Which you definitely want, but this enhancement seems maybe not ready to commit to (e.g. here).
  • Patch bumps never make any changes large enough that you'd actually want to block them. Which you also want, but I have no idea how well folks SemVer their operators, if they SemVer them at all...

### Goals

- Provide operators with the means to explicitly communicate a condition that influences OLM behavior, such as the "Upgrade Readiness" condition defined above.
- Preserve existing behavior if the operator does not take advantage of this feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an intent here that OLM will then propagate this status to the CVO in order to block OCP upgrades under certain conditions?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operators saying "don't update me" and "don't update the host cluster where I live" seem like two different things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's somewhat orthogonal but i bring it up because we are moving forward with documentation that says "make sure all your operators are at the latest version before your upgrade your cluster", so if we know it's physically impossible for your operators to be at the latest version for some reason, it seems worth considering if we should also blocked the cluster upgrade. (in fact even if it's possible to upgrade your operators but you haven't done so, it would also be nice to block the upgrade).

does not have to be covered here, but wanted it in people's heads.

Copy link
Member

@ecordell ecordell Aug 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not explicit here (because this is an operator-framework enhancement) but one goal here is to allow more communication between OLM <> Operators, and in the future use that to inform the communication between OLM <> CVO and decisions around cluster upgrades. That will be spelled out in a future openshift enhancement, though

The communication channel with OLM is valuable independent of CVO/upgrades, so we're tackling that here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be worth changing Upgradeable to OperatorUpgradeable or some such in this enhancement to distinguish between the two cases, so we don't end up with ClusterUpgradeable about "please don't bump the host system" and an unqualified Upgradeable about "please don't bump me, the operator".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be worth changing Upgradeable to OperatorUpgradeable or some such in this enhancement to distinguish between the two cases, so we don't end up with ClusterUpgradeable about "please don't bump the host system" and an unqualified Upgradeable about "please don't bump me, the operator".

I don't believe that this will be an issue given that this communication channel will exist between OLM and the operator that it manages. OLM would still communicate to CVO using the existing clusteroperator resource.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how would OLM know which sub-operator Upgradeable=False we're "hold the core, my current version is incompatible with future core APIs" and should be passed along in the ClusterOperator and which were "don't bump me, my future versions will break on the current cluster" and should not be passed on?

Copy link
Member Author

@awgreene awgreene Aug 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OLM team has discussed the scenarios that @wking is describing but I don't believe that we've made any final decisions.

That being said, I don't know that operators will be responsible for handling either of these scenarios. They certainly can be, but it places a lot of burden on the operators to understand the requirements of an operator that hasn't been released yet (version n of an operator must know the requirements of operator version n+1).

In the past, we had discussed having OLM make intelligent decisions when upgrading an operator by looking at available APIs and the RBAC associated with the operator. In this scenario, OLM could identify both of the cases @wking mentioned.

Again, I haven't kept up with this aspect of OLM and would welcome a comment from @ecordell if I am operating off stale info.

Copy link
Member

@ecordell ecordell Aug 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wking that's the "other side" of this problem (and isn't something that should be addressed in this operator-framework enhancement).

This enhancement is just for operators communicating to OLM that they're in the middle of something important, which has been a long-standing problem for upgrades.

With this enhancement done, OLM could later use that to inform CVO that an operator is currently not in an interruptible state (which likely should mean reporting itself Upgradeable=false with the ability for an admin to indicate that they don't care about interrupting certain operators).

But that's not sufficient to determine whether a cluster upgrade is safe. The latest thing we've discussed on that front is that we probably want to default to OLM reporting Upgradeable=false to CVO for any operator that's installed that doesn't have an explicit "I work on OCP version X" where X is the next minor release (the mechanism of how we get that information still under some discussion) - but that will be a new enhancement in openshift/enhancements, since it's OCP-specific.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this enhancement done, OLM could later use that to inform CVO that an operator is currently not in an interruptible state (which likely should mean reporting itself Upgradeable=false with the ability for an admin to indicate that they don't care about interrupting certain operators).

On the CVO side, Upgradeable=False only blocks minor updates, not patch level updates. If a particular workload wants to mark itself uninterruptible, or otherwise make claims about how interruptible it is, it should use pod disruption budgets. That will also protect you from things like deschedulers. So maybe the only thing that needs a new, update-specific channel is "I'm not comfortable with the host environment or other external dependencies updating right now".

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Aug 20, 2020
@openshift-ci-robot
Copy link

New changes are detected. LGTM label has been removed.

@awgreene
Copy link
Member Author

Closing this in favor of #42 with the expectation that we will eventually revisit the discussion on how to provide operators with the means to communicate complex states to OLM without writing to an OLM CRD directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.