|
| 1 | += ADR034: Foundation for conversion webhooks deployment |
| 2 | + |
| 3 | +v0.1 |
| 4 | +:status: accepted |
| 5 | +:date: 2024-01-09 |
| 6 | + |
| 7 | +* Status: {status} |
| 8 | +* Deciders: |
| 9 | +** Sebastian Bernauer |
| 10 | +** Andrew Kenworthy |
| 11 | +** Sascha Lautenschlaeger |
| 12 | +** Razvan Mihai |
| 13 | +** Natalie Röijezon |
| 14 | +** Malte Sander |
| 15 | +* Date: {date} |
| 16 | +
|
| 17 | +Technical Story: https://github.com/stackabletech/issues/issues/361 |
| 18 | + |
| 19 | +== Context |
| 20 | + |
| 21 | +We must version our CustomResourceDefinitions (CRDs). |
| 22 | +This step allows us to move away from unstable alpha or beta versions (like `v1alhpa1`) to stable versions like `v1` or `v2`. |
| 23 | +These versions provide stable interfaces which customers can rely on. |
| 24 | +Since we cannot avoid having breaking changes in the future (which require a bump in the respective CRD version), we have to supply conversion webhooks that take care of converting older versions to the current storage version. |
| 25 | + |
| 26 | +Converting custom resources between versions is a separate step, independent of webhook deployments. |
| 27 | +CRD versions should be seamlessly upgraded when new operators/webhooks are upgraded. Downgrades are possible by first converting the custom resources to the old version and then downgrading the operator and webhook. |
| 28 | + |
| 29 | +A conversion webhook is registered in a CRD like this: |
| 30 | + |
| 31 | +[source,yaml] |
| 32 | +---- |
| 33 | +spec: |
| 34 | + conversion: |
| 35 | + strategy: Webhook |
| 36 | + webhook: |
| 37 | + conversionReviewVersions: ["v1"] |
| 38 | + clientConfig: |
| 39 | + service: |
| 40 | + namespace: default |
| 41 | + name: example-conversion-webhook-server |
| 42 | + path: /crdconvert |
| 43 | + caBundle: "Ci0tLS0tQk...<base64-encoded PEM bundle>...tLS0K" |
| 44 | +---- |
| 45 | + |
| 46 | +This ADR is about the location of the webhook endpoint / server which the `spec.conversion.webhook.clientConfig.service` block is referencing. |
| 47 | + |
| 48 | + |
| 49 | +=== Use case: CRD downgrades |
| 50 | + |
| 51 | +There can be multiple CRD versions for an operator. There is only one stored version and multiple served versions of the CRDs. |
| 52 | + |
| 53 | +Setting: |
| 54 | + |
| 55 | +* old crd version "v1" |
| 56 | +* new crd version "v2" |
| 57 | +* there is a cluster/stacklet in version "v2" running |
| 58 | + |
| 59 | +Downgrade procedure: |
| 60 | + |
| 61 | +* Step 1: request the cluster definition in "v1" and apply it again |
| 62 | +* Step 2: donwgrade operator and webhook deployments |
| 63 | + |
| 64 | +[NOTE] |
| 65 | +==== |
| 66 | +This works, because the cluster version has been downgraded before the webhook has been downgraded. |
| 67 | +This means that the webhook and the operator can be deployed in lock-step. |
| 68 | +==== |
| 69 | + |
| 70 | +Proposal: we could implement step 1 as a convenience in stackablectl and/or document how to perform it with kubectl or the https://github.com/kubernetes-sigs/kube-storage-version-migrator[storage migrator] |
| 71 | + |
| 72 | +== Problem Statement |
| 73 | + |
| 74 | +There are several options on how or where to deploy a conversion webhook, e.g. coupled closely with the operator as a controller or completely decoupled via an extra deployment. |
| 75 | + |
| 76 | +We need a uniform deployment across all operators to keep implementation and maintenance to a minimum and reuse code wherever possible. |
| 77 | +Additionally, webhooks should be enabled / disabled on demand via options like Helm, operator-parameters or CRD flags. |
| 78 | + |
| 79 | +Furthermore, in terms of downgrading, webhooks should always be deployed in their "latest" version, meaning they can convert all supported (new) versions. |
| 80 | + |
| 81 | +== Discussion questions |
| 82 | + |
| 83 | +- Do we want this to be HA? |
| 84 | +- Do we want this to be deployed in a decoupled way? |
| 85 | +- One operator per Kubernetes cluster: What if 3 operators deployed watching different namespaces / versions? Should be strongly discouraged! |
| 86 | +- How to abstract a common admission/conversion webhook skeleton in operator-rs, that can be implemented in the operators within a few lines of code (excluding the actual conversion code)? |
| 87 | +- How to keep maintenance, updating, pipelines or extra images to a minimum? |
| 88 | +- How to deactivate or not deploy the conversion webhook if not desired by customers? Or how to activate if opt-in? |
| 89 | + |
| 90 | +== Decision Drivers |
| 91 | + |
| 92 | +* Keep pipelining / maintenance / extra images / code to a minimum |
| 93 | +* Operator and webhook are deployed in lock-step |
| 94 | +* Must be deployable with Operator Lifecycle Manager (OLM) |
| 95 | +** OLM deploys webhooks together with operators in the same Cluster Service Version (CSV). This means, webhooks and operators are NOT independently up- or down-gradable. Also see the <<olm-notes>>. |
| 96 | +** Helm charts and OLM bundles should not diverge in functionality. This is to reduce maintenance costs. |
| 97 | +* The webhook has to keep working if the operator crashes |
| 98 | + |
| 99 | +[[olm-notes]] |
| 100 | +=== OLM Notes |
| 101 | + |
| 102 | +OLM is a Kubernetes operator that manages the lifecycle of other operators. |
| 103 | +It is used to install, update, and remove operators and their associated services. |
| 104 | +OLM uses a custom resource called a ClusterServiceVersion (CSV) to manage the lifecycle of an operator. |
| 105 | +A CSV is a manifest that describes the operator and its associated services. |
| 106 | +It contains metadata about the operator, such as its name, version, and supported Kubernetes versions. |
| 107 | +It also contains a list of resources that the operator manages, such as custom resource definitions (CRDs), roles, role bindings and most relevant for this ADR, webhook deployments. |
| 108 | + |
| 109 | +Webhooks managed by OLM are deployed together with the operator in the same ClusterServiceVersion (CSV) but as a separate Deployment. |
| 110 | +The webhook and the operator manage the same ClusterResourceDefinitions marked as `owned` in the CSV. |
| 111 | + |
| 112 | +Any CSV that contains conversion webhooks must support the `AllNamespaces` install mode. |
| 113 | +This is because webhooks are cluster-wide resources and must be installed in all namespaces. |
| 114 | + |
| 115 | +The |
| 116 | + |
| 117 | +- `spec.conversion.webhook.clientConfig.service.namespace` and |
| 118 | +- `spec.conversion.webhook.clientConfig.service.name` |
| 119 | + |
| 120 | +fields of the CRD is a required field. |
| 121 | +For OLM, this means that the webhook must be deployed in that namespace together with the operator. |
| 122 | +This is a limitation of OLM and is not something that can be changed. |
| 123 | + |
| 124 | +For more details regarding OLM constraints for webhooks, see the OpenShift Container Platform https://docs.openshift.com/container-platform/4.14/operators/operator_sdk/osdk-generating-csvs.html#olm-webhook-considerations_osdk-generating-csvs[documentation]. |
| 125 | + |
| 126 | +== Considered Options |
| 127 | + |
| 128 | +[[option1]] |
| 129 | +=== Option 1: Deploy within the Operator as Controller |
| 130 | + |
| 131 | +The operator contains another controller in a separate thread with the webhook server and conversion code. |
| 132 | + |
| 133 | +==== Pros |
| 134 | + |
| 135 | +- No extra bin / main file |
| 136 | +- No extra docker image (Openshift certification) |
| 137 | +- No extra pipelines for the build process |
| 138 | +- Always up to date with the operator, no extra versioning |
| 139 | + |
| 140 | +==== Cons |
| 141 | + |
| 142 | +- Downgrade not possible -> older operators may not know new storage versions |
| 143 | +- Operator crash affects webhook, no custom resources can be applied for that time |
| 144 | + -> prevents writes and reads only current versions works |
| 145 | +- Updating webhook requires updating the whole operator |
| 146 | +- (OpenShift restrictions? Restricted namespaces etc.?) |
| 147 | + |
| 148 | +[[option2]] |
| 149 | +=== Option 2: Deploy within the Operator as Extra Container with Operator Image |
| 150 | + |
| 151 | +The operator deployment contains another container next to the actual operator containing the webhook server and conversion code using the operator docker image. |
| 152 | + |
| 153 | +==== Pros |
| 154 | + |
| 155 | +- No extra pipelines for the build process |
| 156 | +- Could be enabled / disabled using Helm parameters |
| 157 | +- Operator crash does not affect webhook |
| 158 | +- Always up to date with the operator, no extra versioning |
| 159 | + |
| 160 | +==== Cons |
| 161 | + |
| 162 | +- Downgrade not possible -> older operators may not know new storage versions |
| 163 | +- Overhead due to operator image (not just the lightweight webhook server) |
| 164 | +- Updating webhook requires updating the whole operator |
| 165 | +- (Extra bin / main file) |
| 166 | +- (OpenShift restrictions? Restricted namespaces etc.?) |
| 167 | + |
| 168 | +[[option3]] |
| 169 | +=== Option 3: Deploy within the Operator as Extra Container and Extra Image |
| 170 | + |
| 171 | +The operator deployment contains another container next to the actual operator containing the webhook server and conversion code using its own docker image. |
| 172 | + |
| 173 | +==== Pros |
| 174 | + |
| 175 | +- No overhead due to operator image (just the lightweight webhook server) |
| 176 | +- Operator crash does not affect webhook |
| 177 | +- Could be enabled / disabled using Helm parameters |
| 178 | +- Always up to date with the operator, no extra versioning |
| 179 | + |
| 180 | +==== Cons |
| 181 | + |
| 182 | +- Downgrade not possible -> older operators may not know new storage versions |
| 183 | +- Updating webhook requires updating the whole operator |
| 184 | +- Extra pipelines / images for the build process |
| 185 | +- (OpenShift restrictions? Restricted namespaces etc.?) |
| 186 | + |
| 187 | +[[option4]] |
| 188 | +=== Option 4: The Operator creates a Webhook Deployment |
| 189 | + |
| 190 | +The operator deploys a webhook Deployment similar to how it deploys e.g. StatefulSets. |
| 191 | + |
| 192 | +==== Pros |
| 193 | + |
| 194 | +- Operator crash does not affect webhook |
| 195 | +- Could be enabled / disabled via custom resource |
| 196 | +- Always up to date with the operator, no extra versioning |
| 197 | +- Should not interfere with OpenShift |
| 198 | + |
| 199 | +==== Cons |
| 200 | + |
| 201 | +- Downgrade not possible -> older operators may not know new storage versions |
| 202 | +- Updating webhook requires updating the whole operator (bundle) |
| 203 | +- Possibly extra image |
| 204 | +- Possibly extra pipelines |
| 205 | +- Possibly more complex to test |
| 206 | + |
| 207 | +[[option5]] |
| 208 | +=== Option 5: The Webhook has its own Deployment |
| 209 | + |
| 210 | +The webhook and the operator are deployed in lock-step, each in it's own Deployment. |
| 211 | +Both deployments are part of the same Helm Chart, OLM CSV, etc. |
| 212 | +The webhook high-availability is achieved with multiple Deployment replicas. |
| 213 | +Both are bundled in the same container image. |
| 214 | + |
| 215 | +==== Pros |
| 216 | + |
| 217 | +- Operator crash does not affect webhook |
| 218 | +- Downgrade possible -> can adept to new CRD storage versions |
| 219 | +- Could be enabled / disabled Helm parameters |
| 220 | +- The webhook can be updated independently |
| 221 | +- No extra pipelines / images |
| 222 | + |
| 223 | +==== Cons |
| 224 | + |
| 225 | +- In OLM environments, if the operator fails to deploy, the webhook is also not deployed. |
| 226 | + |
| 227 | +== Decision Outcome |
| 228 | + |
| 229 | +Chosen <<option5>>, because it fits on all decision drivers. |
| 230 | + |
| 231 | +== Links |
| 232 | + |
| 233 | +- ADR https://docs.stackable.tech/home/nightly/contributor/adr/adr034-foundation-webhooks-ca-bundle.adoc[CA bundle injection] |
| 234 | +- https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/[Kubernetes CRD versioning] |
0 commit comments