Upgrade Readiness Enhancement

awgreene · awgreene · commit 001557158ed7 · 2020-05-26T18:48:04.000-04:00
diff --git a/enhancements/upgrade-readiness.md b/enhancements/upgrade-readiness.md
@@ -0,0 +1,291 @@
+---
+title: operator-status
+authors:
+  - "@awgreene"
+reviewers:
+  - "@ecordell"
+approvers:
+  - "@kevinrizza"
+creation-date: 2020-05-19
+last-updated: yyyy-mm-dd
+status: implementable
+see-also:
+  - "N/A"  
+replaces:
+  - "N/A"
+superseded-by:
+  - "N/A"
+---
+
+# operator-status
+
+## Release Signoff Checklist
+
+- [x] Enhancement is `implementable`
+- [ ] Design details are appropriately documented from clear requirements
+- [ ] Test plan is defined
+- [ ] Graduation criteria for dev preview, tech preview, GA
+
+## Summary
+
+Managing the upgrade of an operator is a core feature of the [Operator Lifecycle Manager (OLM) project](https://github.com/operator-framework/operator-lifecycle-manager). There are instances where an operator is performing a long running critical task during which it should not be interrupted by an upgrade initiated by OLM. OLM must provide operators with the means to communicate whether or not the operator is ready to be upgraded.
+
+## Motivation
+
+Today, OLM does not provide operators with the ability to block upgrades as the operator completes a critical process. Consider the following example:
+
+An operator has three versions available, v1, v2, and v3 where v2 has a long-running process that migrates resources in v1 to the new schema in v2. As of today, if operator v1 was installed with automatic updates enabled, and v2 and v3 became available at the same time, OLM would upgrade the operator from v1 to v2 to v3 without allowing v2's migration process to run. 
+
+The scenario above could place the operator in an unrecoverable state. OLM must provide operators that it manages with a method to commuincate whether or not the operator is ready to be upgraded. 
+
+### Goals
+
+- Provide opreators with the means to communicate whether or not they should be upgraded.
+- Update OLM so operators are not upgraded if they communicate that they should not be upgraded.
+- Preserve existing behavior should the operator not commuincate its upgradeable status.
+
+### Non-Goals
+
+- Support operator version rollback.
+- Provide operators with the means to cancle an upgrade that is in process.
+
+## Proposal
+
+### User Stories
+
+#### Story 1
+
+As an operator author, I want my operator to be able to communicate that OLM should not attempt to upgrade my operator until I remove the flag.
+
+### Implementation Details/Notes/Constraints [optional]
+
+Operators can communicate update readiness by exposing an `upgradeReadiness` endpoint. This endpoint can communicate readiness by:
+
+- printing 0, indicating that the operator cannot be upgraded at this time.
+- printing 1, indicating that the operator is ready to be upgraded.
+
+This endpoint will be exposed via a [Service resource](https://kubernetes.io/docs/concepts/services-networking/service/#service-resource) that OLM will probe for update readiness prior to upgrading the operator to the next version.
+
+This feature must be supported by both CSV and CSVless bundles.
+
+#### CSV Bundles
+
+OLM's [ClusterServiceVersion](https://github.com/operator-framework/api/blob/197407cd70e8ddfef85d21216085ed52fbb4bb2d/pkg/operators/v1alpha1/clusterserviceversion_types.go#L500) type will be updated to include a upgradeableService for each [StrategyDeploymentSpec](https://github.com/operator-framework/api/blob/197407cd70e8ddfef85d21216085ed52fbb4bb2d/pkg/operators/v1alpha1/clusterserviceversion_types.go#L63).
+
+There are two options for defining a Upgradeable Readiness Service in the CSV.
+
+##### Approach 1: Define the service
+
+As a developer, you may wish to include the readiness service in the CSV for rapid itteration. This can be accomplished by setting the upgradeReadinessServices's spec in the CSV.
+
+```yaml
+apiVersion: operators.coreos.com/v1alpha1
+kind: ClusterServiceVersion
+metadata:
+  name: foo-operator.v1.0.0
+  namespace: olm
+spec:
+  description: |
+    An example opreator
+  displayName: foo-operator
+  install:
+    spec:
+      deployments:
+      - name: foo-operator
+        upgradeReadinessServices:              # An array of readiness services
+        - name: foo-operator-readiness-service # The Service name
+          spec:                                # Standard Service spec
+            selector:                            
+              app: foo-operator
+            ports:
+            - protocol: TCP
+              port: 80
+              targetPort: 9376
+        spec:
+          replicas: 1
+          selector:
+            matchLabels:
+                app: foo-operator
+          template:
+            metadata:
+              labels:
+                app: foo-operator
+            spec:
+              containers:
+              - name: foo-operator
+                image: foo
+                command:
+                  - sleep
+                  - "3600"
+    strategy: deployment
+  installModes:
+  - supported: false
+    type: OwnNamespace
+  - supported: true
+    type: SingleNamespace
+  - supported: true
+    type: MultiNamespace
+  - supported: true
+    type: AllNamespaces
+  maturity: alpha
+  provider:
+    name: Red Hat
+  version: 1.0.0
+```
+##### Approach 2: Omitting the Upgradeable Readiness Service Spec
+
+This approach allows operator authors to provide the name of a upgradeable readiness service that should appear on cluster. When using this approach, the service will typically be present in the operator bundle.
+
+First, you must create a CSV that omits the spec of the upgradeReadinessService:
+
+```yaml
+apiVersion: operators.coreos.com/v1alpha1
+kind: ClusterServiceVersion
+metadata:
+  name: foo-operator.v1.0.0
+  namespace: olm
+spec:
+  description: |
+    An example opreator
+  displayName: foo-operator
+  install:
+    spec:
+      deployments:
+      - name: foo-operator
+        readinessServices:                     # An array of upgradeReadiness services
+        - name: foo-operator-readiness-service # The name of the Service
+        spec:
+          replicas: 1
+          selector:
+            matchLabels:
+                app: foo-operator
+          template:
+            metadata:
+              labels:
+                app: foo-operator
+            spec:
+              containers:
+              - name: foo-operator
+                image: foo
+                command:
+                  - sleep
+                  - "3600"
+    strategy: deployment
+  installModes:
+  - supported: false
+    type: OwnNamespace
+  - supported: true
+    type: SingleNamespace
+  - supported: true
+    type: MultiNamespace
+  - supported: true
+    type: AllNamespaces
+  maturity: alpha
+  provider:
+    name: Red Hat
+  version: 1.0.0
+```
+
+The operator bundle should include the following Service:
+```yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: foo-operator-readiness-service
+spec:
+  selector:
+    app: foo-operator
+  ports:
+    - protocol: TCP
+      port: 80
+      targetPort: 9376
+```
+
+#### CSVless Bundles
+
+In the case of a CSVless bundle, an annotation can be applied to the `annotations.yaml` which incdicates the name of the upgradeReadinessService and the deployment name:
+
+```yaml
+  operators.operatorframework.io.deployment.<DeploymentName>.readiness.service: "<ServiceName>"
+```
+
+Once the `annotations.yaml` is created with the key/value above, the following resources should appear in the operator bundle:
+
+- A service whose name matches the `ServiceName` in your annotation which routes to the readiness endpoint.
+- A deployment
+  - Whose name matches the DeploymentName used in your annotation
+  - Has a `podTemplate` with an annotation that is selected by your label
+
+### Risks and Mitigations
+
+- Not all operators may take advanatage of this feature. If an operator does not opt-in to this feature, OLM will fallback on existing behavior when checking if an operator can be upgraded.
+
+- Operator may opt-in to this feature but never update the endpoint as "upgradeable". In these situations, cluster admins must uninstall the operator and upgrade the operator manually.
+
+- If OLM reports metrics to Telemeter regarding whether or not an operator is upgradeable an unbounded metric is being reported. An unbounded metric may overwhelm Telemeter.
+
+## Design Details
+
+### Test Plan
+
+OLM's e2e testing suite will be upgraded to handle the usecase:
+
+- If a CSV defines a upgradeReadiness Service without a spec that does not exist, upgrades for that operator are blocked.
+- If a CSVless bundle defines a upgradeReadiness Service that does not exist, the bundle fails to build.
+- If a CSV defines an upgradeReainess Service with a spec, changes to the service are reverted by OLM.
+- If a CSV/CSVless bundle defines an upgradeReainess Service that returns a 0, upgrades to the operator are blocked. The operator then upgrades when the service is no longer blocked.
+
+Additionally, tests that use CSVs that omit the upgradeReainess Service must continue to adhear to existing behavior.
+
+
+## Implementation History
+
+Major milestones in the life cycle of a proposal should be tracked in `Implementation
+History`.
+
+## Alternatives
+
+The are viable alternatives to the solution proposed in this enhancement.
+
+### Subsbitute the UpgradeReadinessService with an UpdateProbe
+
+It is possible that we could take advantage of the standard defined by [Liveness, Readiness, and StartUp Probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) by importing the [Prober](https://github.com/kubernetes/kubernetes/blob/release-1.18/pkg/kubelet/prober/prober.go#L48) code. There are cons and benefits to this approach defined below:
+
+**This seems to be the most viable alternative.**
+
+#### Pros
+
+- Follows a well defined probe format which supports [HTTP Gets, EXEC, and TCPSocket](https://github.com/kubernetes/api/blob/6f652b6ce59c386f4a431eb031d0339620aaff5e/core/v1/types.go#L2279-L2292) options.
+- Enables OLM to identify which POD is reporting a upgradeReadiness failure, as a service could route to one of any number of pods.
+
+#### Cons
+- Requires OLM create an import and maintain prober code which executes the probe.
+
+### Create an API so Operators may report Status
+
+OLM could introduce a new API that allows operators to report their upgradeReadiness
+
+#### Pros
+
+- Provides a single source of truth for upgradeReadiness.
+
+#### Cons
+
+- Increases the number of APIs that OLM introduces. OLM has already received negative feedback on the number of APIs one must work with.
+- Behaves differently from other standard Kubernetes probes.
+
+### Operators write to the status of the CSV/operator API
+
+OLM could allow operator authors to write to the status object of an API that OLM manages.
+
+#### Pros
+
+- No new APIs
+- Should require less OLM code
+
+#### Cons
+
+- Places ownice on Operator Authors to learn OLM specific readiness reporting conventions.
+- Requires Operator Authors to write to the status of an OLM owned resource.
+- OLM would probably need to provide operator authors with a library to report readiness to standardize the approach.
+- Operators could place the operator's status in a bad state.