Skip to content

Commit 0015571

Browse files
committed
Upgrade Readiness Enhancement
1 parent 717d321 commit 0015571

File tree

1 file changed

+291
-0
lines changed

1 file changed

+291
-0
lines changed

enhancements/upgrade-readiness.md

Lines changed: 291 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,291 @@
1+
---
2+
title: operator-status
3+
authors:
4+
- "@awgreene"
5+
reviewers:
6+
- "@ecordell"
7+
approvers:
8+
- "@kevinrizza"
9+
creation-date: 2020-05-19
10+
last-updated: yyyy-mm-dd
11+
status: implementable
12+
see-also:
13+
- "N/A"
14+
replaces:
15+
- "N/A"
16+
superseded-by:
17+
- "N/A"
18+
---
19+
20+
# operator-status
21+
22+
## Release Signoff Checklist
23+
24+
- [x] Enhancement is `implementable`
25+
- [ ] Design details are appropriately documented from clear requirements
26+
- [ ] Test plan is defined
27+
- [ ] Graduation criteria for dev preview, tech preview, GA
28+
29+
## Summary
30+
31+
Managing the upgrade of an operator is a core feature of the [Operator Lifecycle Manager (OLM) project](https://github.com/operator-framework/operator-lifecycle-manager). There are instances where an operator is performing a long running critical task during which it should not be interrupted by an upgrade initiated by OLM. OLM must provide operators with the means to communicate whether or not the operator is ready to be upgraded.
32+
33+
## Motivation
34+
35+
Today, OLM does not provide operators with the ability to block upgrades as the operator completes a critical process. Consider the following example:
36+
37+
An operator has three versions available, v1, v2, and v3 where v2 has a long-running process that migrates resources in v1 to the new schema in v2. As of today, if operator v1 was installed with automatic updates enabled, and v2 and v3 became available at the same time, OLM would upgrade the operator from v1 to v2 to v3 without allowing v2's migration process to run.
38+
39+
The scenario above could place the operator in an unrecoverable state. OLM must provide operators that it manages with a method to commuincate whether or not the operator is ready to be upgraded.
40+
41+
### Goals
42+
43+
- Provide opreators with the means to communicate whether or not they should be upgraded.
44+
- Update OLM so operators are not upgraded if they communicate that they should not be upgraded.
45+
- Preserve existing behavior should the operator not commuincate its upgradeable status.
46+
47+
### Non-Goals
48+
49+
- Support operator version rollback.
50+
- Provide operators with the means to cancle an upgrade that is in process.
51+
52+
## Proposal
53+
54+
### User Stories
55+
56+
#### Story 1
57+
58+
As an operator author, I want my operator to be able to communicate that OLM should not attempt to upgrade my operator until I remove the flag.
59+
60+
### Implementation Details/Notes/Constraints [optional]
61+
62+
Operators can communicate update readiness by exposing an `upgradeReadiness` endpoint. This endpoint can communicate readiness by:
63+
64+
- printing 0, indicating that the operator cannot be upgraded at this time.
65+
- printing 1, indicating that the operator is ready to be upgraded.
66+
67+
This endpoint will be exposed via a [Service resource](https://kubernetes.io/docs/concepts/services-networking/service/#service-resource) that OLM will probe for update readiness prior to upgrading the operator to the next version.
68+
69+
This feature must be supported by both CSV and CSVless bundles.
70+
71+
#### CSV Bundles
72+
73+
OLM's [ClusterServiceVersion](https://github.com/operator-framework/api/blob/197407cd70e8ddfef85d21216085ed52fbb4bb2d/pkg/operators/v1alpha1/clusterserviceversion_types.go#L500) type will be updated to include a upgradeableService for each [StrategyDeploymentSpec](https://github.com/operator-framework/api/blob/197407cd70e8ddfef85d21216085ed52fbb4bb2d/pkg/operators/v1alpha1/clusterserviceversion_types.go#L63).
74+
75+
There are two options for defining a Upgradeable Readiness Service in the CSV.
76+
77+
##### Approach 1: Define the service
78+
79+
As a developer, you may wish to include the readiness service in the CSV for rapid itteration. This can be accomplished by setting the upgradeReadinessServices's spec in the CSV.
80+
81+
```yaml
82+
apiVersion: operators.coreos.com/v1alpha1
83+
kind: ClusterServiceVersion
84+
metadata:
85+
name: foo-operator.v1.0.0
86+
namespace: olm
87+
spec:
88+
description: |
89+
An example opreator
90+
displayName: foo-operator
91+
install:
92+
spec:
93+
deployments:
94+
- name: foo-operator
95+
upgradeReadinessServices: # An array of readiness services
96+
- name: foo-operator-readiness-service # The Service name
97+
spec: # Standard Service spec
98+
selector:
99+
app: foo-operator
100+
ports:
101+
- protocol: TCP
102+
port: 80
103+
targetPort: 9376
104+
spec:
105+
replicas: 1
106+
selector:
107+
matchLabels:
108+
app: foo-operator
109+
template:
110+
metadata:
111+
labels:
112+
app: foo-operator
113+
spec:
114+
containers:
115+
- name: foo-operator
116+
image: foo
117+
command:
118+
- sleep
119+
- "3600"
120+
strategy: deployment
121+
installModes:
122+
- supported: false
123+
type: OwnNamespace
124+
- supported: true
125+
type: SingleNamespace
126+
- supported: true
127+
type: MultiNamespace
128+
- supported: true
129+
type: AllNamespaces
130+
maturity: alpha
131+
provider:
132+
name: Red Hat
133+
version: 1.0.0
134+
```
135+
##### Approach 2: Omitting the Upgradeable Readiness Service Spec
136+
137+
This approach allows operator authors to provide the name of a upgradeable readiness service that should appear on cluster. When using this approach, the service will typically be present in the operator bundle.
138+
139+
First, you must create a CSV that omits the spec of the upgradeReadinessService:
140+
141+
```yaml
142+
apiVersion: operators.coreos.com/v1alpha1
143+
kind: ClusterServiceVersion
144+
metadata:
145+
name: foo-operator.v1.0.0
146+
namespace: olm
147+
spec:
148+
description: |
149+
An example opreator
150+
displayName: foo-operator
151+
install:
152+
spec:
153+
deployments:
154+
- name: foo-operator
155+
readinessServices: # An array of upgradeReadiness services
156+
- name: foo-operator-readiness-service # The name of the Service
157+
spec:
158+
replicas: 1
159+
selector:
160+
matchLabels:
161+
app: foo-operator
162+
template:
163+
metadata:
164+
labels:
165+
app: foo-operator
166+
spec:
167+
containers:
168+
- name: foo-operator
169+
image: foo
170+
command:
171+
- sleep
172+
- "3600"
173+
strategy: deployment
174+
installModes:
175+
- supported: false
176+
type: OwnNamespace
177+
- supported: true
178+
type: SingleNamespace
179+
- supported: true
180+
type: MultiNamespace
181+
- supported: true
182+
type: AllNamespaces
183+
maturity: alpha
184+
provider:
185+
name: Red Hat
186+
version: 1.0.0
187+
```
188+
189+
The operator bundle should include the following Service:
190+
```yaml
191+
apiVersion: v1
192+
kind: Service
193+
metadata:
194+
name: foo-operator-readiness-service
195+
spec:
196+
selector:
197+
app: foo-operator
198+
ports:
199+
- protocol: TCP
200+
port: 80
201+
targetPort: 9376
202+
```
203+
204+
#### CSVless Bundles
205+
206+
In the case of a CSVless bundle, an annotation can be applied to the `annotations.yaml` which incdicates the name of the upgradeReadinessService and the deployment name:
207+
208+
```yaml
209+
operators.operatorframework.io.deployment.<DeploymentName>.readiness.service: "<ServiceName>"
210+
```
211+
212+
Once the `annotations.yaml` is created with the key/value above, the following resources should appear in the operator bundle:
213+
214+
- A service whose name matches the `ServiceName` in your annotation which routes to the readiness endpoint.
215+
- A deployment
216+
- Whose name matches the DeploymentName used in your annotation
217+
- Has a `podTemplate` with an annotation that is selected by your label
218+
219+
### Risks and Mitigations
220+
221+
- Not all operators may take advanatage of this feature. If an operator does not opt-in to this feature, OLM will fallback on existing behavior when checking if an operator can be upgraded.
222+
223+
- Operator may opt-in to this feature but never update the endpoint as "upgradeable". In these situations, cluster admins must uninstall the operator and upgrade the operator manually.
224+
225+
- If OLM reports metrics to Telemeter regarding whether or not an operator is upgradeable an unbounded metric is being reported. An unbounded metric may overwhelm Telemeter.
226+
227+
## Design Details
228+
229+
### Test Plan
230+
231+
OLM's e2e testing suite will be upgraded to handle the usecase:
232+
233+
- If a CSV defines a upgradeReadiness Service without a spec that does not exist, upgrades for that operator are blocked.
234+
- If a CSVless bundle defines a upgradeReadiness Service that does not exist, the bundle fails to build.
235+
- If a CSV defines an upgradeReainess Service with a spec, changes to the service are reverted by OLM.
236+
- If a CSV/CSVless bundle defines an upgradeReainess Service that returns a 0, upgrades to the operator are blocked. The operator then upgrades when the service is no longer blocked.
237+
238+
Additionally, tests that use CSVs that omit the upgradeReainess Service must continue to adhear to existing behavior.
239+
240+
241+
## Implementation History
242+
243+
Major milestones in the life cycle of a proposal should be tracked in `Implementation
244+
History`.
245+
246+
## Alternatives
247+
248+
The are viable alternatives to the solution proposed in this enhancement.
249+
250+
### Subsbitute the UpgradeReadinessService with an UpdateProbe
251+
252+
It is possible that we could take advantage of the standard defined by [Liveness, Readiness, and StartUp Probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) by importing the [Prober](https://github.com/kubernetes/kubernetes/blob/release-1.18/pkg/kubelet/prober/prober.go#L48) code. There are cons and benefits to this approach defined below:
253+
254+
**This seems to be the most viable alternative.**
255+
256+
#### Pros
257+
258+
- Follows a well defined probe format which supports [HTTP Gets, EXEC, and TCPSocket](https://github.com/kubernetes/api/blob/6f652b6ce59c386f4a431eb031d0339620aaff5e/core/v1/types.go#L2279-L2292) options.
259+
- Enables OLM to identify which POD is reporting a upgradeReadiness failure, as a service could route to one of any number of pods.
260+
261+
#### Cons
262+
- Requires OLM create an import and maintain prober code which executes the probe.
263+
264+
### Create an API so Operators may report Status
265+
266+
OLM could introduce a new API that allows operators to report their upgradeReadiness
267+
268+
#### Pros
269+
270+
- Provides a single source of truth for upgradeReadiness.
271+
272+
#### Cons
273+
274+
- Increases the number of APIs that OLM introduces. OLM has already received negative feedback on the number of APIs one must work with.
275+
- Behaves differently from other standard Kubernetes probes.
276+
277+
### Operators write to the status of the CSV/operator API
278+
279+
OLM could allow operator authors to write to the status object of an API that OLM manages.
280+
281+
#### Pros
282+
283+
- No new APIs
284+
- Should require less OLM code
285+
286+
#### Cons
287+
288+
- Places ownice on Operator Authors to learn OLM specific readiness reporting conventions.
289+
- Requires Operator Authors to write to the status of an OLM owned resource.
290+
- OLM would probably need to provide operator authors with a library to report readiness to standardize the approach.
291+
- Operators could place the operator's status in a bad state.

0 commit comments

Comments
 (0)