Skip to content

Don't use readiness condition to detect if updates are in progress #922

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rmohr opened this issue Jun 21, 2019 · 16 comments
Closed

Don't use readiness condition to detect if updates are in progress #922

rmohr opened this issue Jun 21, 2019 · 16 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/unresolved Indicates an issue that can not or will not be resolved.

Comments

@rmohr
Copy link

rmohr commented Jun 21, 2019

/kind bug

Using the readiness probe to detect if updates are in progress (by setting it to false), like described in #879 is conceptually wrong and may cause multiple issues on the long run.

Background: k8s treats not ready components as early in the initialization phase, which are not yet ready to fulfill their tasks. See https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/controller_utils.go#L741 for details:

  • k8s treats not-ready pods as a better delete target than ready pods
  • k8s treats ready pods which are less long ready than other ones as a better delete target

Further pods are supposed to be ready after some time:

  • k8s will be very noisy about pods being not ready, it sees it as something which indicates an issue and will crate periodically warning events

Further it is very common to detect prometheus scrape targets via endpoints (supported by prometheus itself). If a target is not ready, it will not be scraped. Consequences:

  • Alertmanager would and should create alerts since no operator instance may reach ready for a longer time period (updates/installs can take some time)
  • The operator would basically be unmonitored, which means that if it gets stuck for instance on a rest call which permanently fails, we can't see it, whereas otherwise we would see errors piling up in the prometheus metrics.

All this are hints that the readiness condition is used the wrong way which can have serious impact.

@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 21, 2019
@rmohr
Copy link
Author

rmohr commented Jun 21, 2019

/cc @rthallisey

@rmohr
Copy link
Author

rmohr commented Jun 21, 2019

/cc @mhrivnak

@davidvossel
Copy link

here's another point.

Readiness may or may not always reflect that the operand's update/deployment is in progress. It could be that there's something else preventing the pod's readiness check from passing that is specific to that operator.

Basically, it's feasible for an operator to be stuck with readiness failing for reasons other than the operand is still updating/deploying. In those cases, the operator will never get updated.

@ecordell
Copy link
Member

ecordell commented Jun 21, 2019

To clarify, we are not suggesting that operators set readiness=false any time they make changes in a cluster. We are suggesting that as part of an operator update, initialization tasks should delay readiness of the new operator.

I agree that readiness probes are not the ideal long term solution. It makes it difficult to perform long-running migrations for the reasons you noted. We likely need a separate "upgrade probe" concept in OLM that can perform arbitrary checks for upgrade safety. CVO solves this problem with a push model (operators write out to a single, shared API to indicate progress) but this has its own set of problems.

In practice though, there are reasons why this may not be as big of an issue as you might think. There are two broad approaches to managing operands for operators:

Many Version Approach: The operator supports running many versions (potentially any version) of an operand.

For example, the etcd operator can run "any" version of etcd. Newer versions of etcd operator indicate improved operator performance and new operator features, and do not affect running etcd clusters managed by the operator.

For this type of operator, upgrade readiness may involve nothing beyond controller readiness. Migrations to data are triggered by user interaction on the operator CRs.

They publish updates under "stability" channels like alpha/beta/stable.

Tightly Coupled Approach: The operator version is tightly coupled to the operand version. Updating the operator version rolls out updates to the operands.

In this case, the operator needs to upgrade operands during an upgrade. Because of this tight coupling, there is likely a very small window of operand versions that can be "automatically" updated (for example, you may choose to allow an update from postgres 9.3.2 to 9.3.3, but probably not 9.3 to 9.4, as that can indicate significant api and behavior changes that would not be transparent to a consumer).

The pattern for operators using this approach is to publish "version" channels (e.g. a "9.3" channel). In this case it is be reasonable for an operator to support an ongoing migration of all operands from all previous versions of the operator. The operator signals readiness when the controller is ready, but automated updates only occur to other operators that can continue that same migration.

@rmohr
Copy link
Author

rmohr commented Jun 26, 2019

I understand the two patterns but I don't understand how this is related. OLM tries to establish a pattern where not-ready means that something special is happening which should not be interrupted by OLM, wheres the whole rest of k8s sees it the other way round, that this pod is not yet as valuable as all the other ones which are ready.

There are many ways to signal that updates are in progress:

  • push model
  • conditions on CRs
  • prometheus metric which can be directly scraped by OLM
  • annotations on the operator deployment
  • ...

all better then using the readiness probe. But maybe I miss something.

Having a model which generates warnings by default does not seem right.

@fabiand
Copy link
Contributor

fabiand commented Jun 26, 2019

We - KubeVirt and friends - consider this a serious issue.

We likely need a separate "upgrade probe" concept in OLM that can perform arbitrary checks for upgrade safety.

Do you think you could provide an alternative upgrade status probeb i.e. through some prometheus metric (which the operator would offer)?

The issue is that we don't necessarily see problems right now, but definelty in operations when there will be hundreds of events, alerts by prometheus, incorrect metrics (due to kube-state-metrics) and so on and so forth just due to a long urnning upgrade - which can in the end hide real issues within the cluster. And this is the real issue.

@mhrivnak
Copy link
Member

@ecordell it does seem problematic that the probe will prevent prometheus from grabbing monitoring data. Any thoughts on dealing with that? I can't think of a workaround.

there will be hundreds of events, alerts by prometheus, incorrect metrics (due to kube-state-metrics) and so on and so forth just due to a long urnning upgrade

What are the limits around this monitoring setup? When do alerts start firing? I can think of some normal workloads that take minutes to start under normal circumstances. That would be more surprising for a controller, but is the monitoring controller-specific? In any case, it would help to understand more specifically what will generate alerts in this prometheus setup.

wheres the whole rest of k8s sees it the other way round, that this pod is not yet as valuable as all the other ones which are ready.

While the readiness probe is used to determine if a pod can accept network traffic, and is used to prioritize which pod to delete when scaling down, I would not make broad conclusions from those two behaviors about how all of k8s should interpret the probe. It would be interesting to ask sig-apps how this use case matches with the intent of a readiness probe.

@rthallisey
Copy link

Since you can have multiple deployments in a CSV file (unifed CSV file), you could create a deployment that monitors the conditions of another operator and reports readiness=false during an upgrade. This works because OLM won't upgrade any deployment from a unified CSV if any operator is readiness=false (OLM views both objects as one operator). A "meta operator" can fulfill this responsibility.

@mhrivnak
Copy link
Member

Since you can have multiple deployments in a CSV file (unifed CSV file), you could create a deployment that monitors the conditions of another operator and reports readiness=false during an upgrade. This works because OLM won't upgrade any deployment from a unified CSV if any operator is readiness=false (OLM views both objects as one operator). A "meta operator" can fulfill this responsibility.

Interesting idea! That's a good point that in the case of OCS and CNV, the meta-operators can take on this. They just need to know how to determine if their associated operators are themselves done upgrading.

For more traditional operators though, where there's just one Pod running most of the time, it would be a shame to require a second Pod just for this purpose.

@fabiand
Copy link
Contributor

fabiand commented Jul 2, 2019 via email

@mhrivnak
Copy link
Member

mhrivnak commented Oct 8, 2019

Upgrade gating should probably be avoided in most cases. We know that an operator can be interrupted at any time, and it needs to be able to start a new Reconcile on a resource and continue where it left off. In most cases, the next release of an operator should be able to continue any upgrade work the previous version was unable to complete.

We also know that upgrades will often skip directly to the newest release in the Channel, potentially skipping several releases, and thus that target release must be able to do any upgrade work that the intervening releases would have done.

In cases of upgrading to a major new release, like 2.y.z to 3.y.z, then it may be useful to drop 2.y.z upgrade logic from the 3 code base. That's a use case worth considering, but a feature like the readiness gate may not be the best tool for the job.

In a case where a definitive handoff needs to be made, a variation on leader election could be useful. A ConfigMap could serve as an upgrade lock that is held by either the operator itself or by OLM to prevent or execute upgrade respectively.

Exiting gracefully might be a more valuable behavior than upgrade gating for most cases. The default grace period for a Pod to terminate is 30s, which would often be plenty of time for a Reconcile function performing upgrade steps to achieve a consistent state where it can safely return. That may not be supported by controller-runtime currently, but enabling reconciles to finish gracefully after the process receives a TERM could be a valuable step toward upgrade safety.

@ecordell
Copy link
Member

ecordell commented Nov 5, 2019

In a case where a definitive handoff needs to be made, a variation on leader election could be useful. A ConfigMap could serve as an upgrade lock that is held by either the operator itself or by OLM to prevent or execute upgrade respectively.

This is how the etcd-operator hands off between versions

@stale
Copy link

stale bot commented Feb 26, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Feb 26, 2020
@mhrivnak
Copy link
Member

This is still an important issue that stakeholders would like resolved.

@stale stale bot removed the wontfix label Feb 26, 2020
@stale
Copy link

stale bot commented Apr 26, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Apr 26, 2020
@openshift-ci-robot openshift-ci-robot added the triage/unresolved Indicates an issue that can not or will not be resolved. label Apr 30, 2020
@stale stale bot removed the wontfix label Apr 30, 2020
@ecordell
Copy link
Member

ecordell commented Jun 5, 2020

We are continuing discussion / design on this issue in this enhancement PR: operator-framework/enhancements#23

@ecordell ecordell closed this as completed Jun 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/unresolved Indicates an issue that can not or will not be resolved.
Projects
None yet
Development

No branches or pull requests

7 participants