diff --git a/keps/prod-readiness/sig-node/2033.yaml b/keps/prod-readiness/sig-node/2033.yaml index cd3f75ecd59..4acbd66ebaa 100644 --- a/keps/prod-readiness/sig-node/2033.yaml +++ b/keps/prod-readiness/sig-node/2033.yaml @@ -1,3 +1,5 @@ kep-number: 2033 alpha: approver: "@ehashman" +beta: + approver: "@soltysh" diff --git a/keps/sig-node/2033-kubelet-in-userns-aka-rootless/README.md b/keps/sig-node/2033-kubelet-in-userns-aka-rootless/README.md index 94c802a5400..b76b31d21df 100644 --- a/keps/sig-node/2033-kubelet-in-userns-aka-rootless/README.md +++ b/keps/sig-node/2033-kubelet-in-userns-aka-rootless/README.md @@ -1,4 +1,11 @@ + -### User Stories +### User Stories (Optional) -Tests are present in several subproject repos and third party repos: -- https://github.com/kubernetes-sigs/kind/blob/v0.17.0/.github/workflows/cgroup2.yaml#L24 -- https://github.com/kubernetes/minikube/blob/v1.29.0/.github/workflows/pr.yml#L293-L410 -- https://github.com/k3s-io/k3s/blob/v1.26.1+k3s1/.github/workflows/cgroup.yaml#L92-L99 -- https://github.com/rootless-containers/usernetes/blob/v20221007.0/.cirrus.yml +[X] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +See [e2e tests](#e2e-tests) below. + +Additional tests are present in several subproject repos and third party repos: +- https://github.com/kubernetes-sigs/kind/blob/v0.29.0/.github/workflows/vm.yaml#L24 +- https://github.com/kubernetes/minikube/blob/v1.36.0/.github/workflows/pr.yml#L299-L415 +- https://github.com/k3s-io/k3s/blob/v1.33.1%2Bk3s1/.github/workflows/e2e.yaml#L56 +- https://github.com/rootless-containers/usernetes/blob/gen2-v20250501.0/.github/workflows/main.yaml + - Covers multi-node clusters with Flannel (VXLAN) + - Covers several host distributions (Ubuntu, CentOS Stream, and Fedora) + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +N/A, as unit tests do not make sense here. -Tests will be added to `kubernetes/test-infra` as well when the [`k8s-infra-prow-build`](https://github.com/kubernetes/k8s.io/blob/a071c4ed0823f193ee29e2f14e191be42dc1a1f0/infra/gcp/terraform/k8s-infra-prow-build/main.tf#L78) cluster -is upgraded to use cgroup v2. -This will probably automatically happen when [GKE bumps up their "regular" channel to Kubernetes v1.26 or later](https://cloud.google.com/kubernetes-engine/docs/how-to/node-system-config). +##### Integration tests + + + + + +N/A, as integration tests do not make sense here. + +##### e2e tests + + + +`NodeConformance` tests are executed using [kubetest2-kindinv](https://github.com/rootless-containers/kubetest2-kindinv). + +"kindinv" stands for "Kubernetes in (Rootless) Docker in (GCE) VM". +GCE VM is used for enabling systemd that is required by Rootless Docker to set up cgroup v2. + +```bash +exec kubetest2 kindinv \ + --boskos-location=http://boskos.test-pods.svc.cluster.local \ + --gcp-zone=us-central1-b \ + --instance-image=ubuntu-os-cloud/ubuntu-2204-lts \ + --instance-type=n2-standard-4 \ + --kind-rootless \ + --user=rootless \ + --build \ + --up \ + --down \ + --test=ginkgo \ + -- \ + --focus-regex='\[NodeConformance\]' \ + --skip-regex='\[Environment:NotInUserNS\]|\[Slow\]' \ + --parallel=8 +``` + +- Prow manifest: https://github.com/kubernetes/test-infra/blob/4b7824ff1cfe00c36062035ab6aea3bb6c2e6ba2/config/jobs/kubernetes/sig-testing/kubernetes-kind.yaml#L615-L678 +- Logs: https://prow.k8s.io/job-history/gs/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kind-rootless ### Graduation Criteria @@ -516,12 +632,13 @@ This will probably automatically happen when [GKE bumps up their "regular" chann Define graduation milestones. -These may be defined in terms of API maturity, or as something else. The KEP -should keep this high-level with a focus on what signals will be looked at to -determine graduation. +These may be defined in terms of API maturity, [feature gate] graduations, or as +something else. The KEP should keep this high-level with a focus on what +signals will be looked at to determine graduation. Consider the following in developing the graduation criteria for this enhancement: - [Maturity levels (`alpha`, `beta`, `stable`)][maturity-levels] +- [Feature gate][feature gate] lifecycle - [Deprecation policy][deprecation-policy] Clearly define what graduation means by either linking to the [API doc @@ -531,39 +648,56 @@ or by redefining what graduation means. In general we try to use the same stages (alpha, beta, GA), regardless of how the functionality is accessed. +[feature gate]: https://git.k8s.io/community/contributors/devel/sig-architecture/feature-gates.md [maturity-levels]: https://git.k8s.io/community/contributors/devel/sig-architecture/api_changes.md#alpha-beta-and-stable-versions [deprecation-policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/ Below are some examples to consider, in addition to the aforementioned [maturity levels][maturity-levels]. -#### Alpha -> Beta Graduation +#### Alpha + +- Feature implemented behind a feature flag +- Initial e2e tests completed and enabled + +#### Beta - Gather feedback from developers and surveys - Complete features A, B, C -- Tests are in Testgrid and linked in KEP +- Additional tests are in Testgrid and linked in KEP +- More rigorous forms of testing—e.g., downgrade tests and scalability tests +- All functionality completed +- All security enforcement completed +- All monitoring requirements completed +- All testing requirements completed +- All known pre-release issues and gaps resolved -#### Beta -> GA Graduation +**Note:** Beta criteria must include all functional, security, monitoring, and testing requirements along with resolving all issues and gaps identified + +#### GA - N examples of real-world usage - N installs -- More rigorous forms of testing—e.g., downgrade tests and scalability tests - Allowing time for feedback +- All issues and gaps identified as feedback during beta are resolved + +**Note:** GA criteria must not include any functional, security, monitoring, or testing requirements. Those must be beta requirements. **Note:** Generally we also wait at least two releases between beta and GA/stable, because there's no opportunity for user feedback, or even bug reports, in back-to-back releases. -#### Removing a Deprecated Flag +**For non-optional features moving to GA, the graduation criteria must include +[conformance tests].** + +[conformance tests]: https://git.k8s.io/community/contributors/devel/sig-architecture/conformance-tests.md + +#### Deprecation + @@ -571,9 +705,7 @@ in back-to-back releases. - Beta: e2e tests coverage. Requires [the cgroup v2 KEP](../20191118-cgroups-v2.md ) to reach Beta or GA. - To move to beta, we need clarity if we intend to define two separate types of conformance suites: - - kubernetes clusters that can run privileged workloads - - kubernetes cluster that are restricted to run unprivileged workloads only + The tests are covered by `NodeConformance` tests (see above). - GA: Assuming no negative user feedback based on production experience, promote after >= 2 releases in beta. Requires [the cgroup v2 KEP](../20191118-cgroups-v2.md ) to reach GA. @@ -602,14 +734,15 @@ components? What are the guarantees? Make sure this is in the test plan. Consider the following in developing a version skew strategy for this enhancement: -- Does this enhancement involve coordinating behavior in the control plane and - in the kubelet? How does an n-2 kubelet without this feature available behave - when this feature is used? +- Does this enhancement involve coordinating behavior in the control plane and nodes? +- How does an n-3 kubelet or kube-proxy without this feature available behave when this feature is used? +- How does an n-1 kube-controller-manager or kube-scheduler without this feature available behave when this feature is used? - Will any other components on the node change? For example, changes to CSI, CRI or CNI may require updating that component before the kubelet. --> -N/A +N/A. +This KEP only affects the internal of kubelet, and does not affect any API. ## Production Readiness Review Questionnaire @@ -619,11 +752,10 @@ Production readiness reviews are intended to ensure that features merging into Kubernetes are observable, scalable and supportable; can be safely operated in production environments, and can be disabled or rolled back in the event they cause increased failures in production. See more in the PRR KEP at -https://git.k8s.io/enhancements/keps/sig-architecture/20190731-production-readiness-review-process.md. +https://git.k8s.io/enhancements/keps/sig-architecture/1194-prod-readiness. -The production readiness review questionnaire must be completed for features in -v1.19 or later, but is non-blocking at this time. That is, approval is not -required in order to be in the release. +The production readiness review questionnaire must be completed and approved +for the KEP to move to `implementable` status and be included in the release. In some cases, the questions below should also have answers in `kep.yaml`. This is to enable automation to verify the presence of the review, and to reduce review @@ -634,17 +766,35 @@ The KEP must have a approver from the team. Please reach out on the [#prod-readiness](https://kubernetes.slack.com/archives/CPNHUMN74) channel if you need any help or guidance. - --> ### Feature Enablement and Rollback -_This section must be completed when targeting alpha to a release._ + + +###### How can this feature be enabled / disabled in a live cluster? + + -* **How can this feature be enabled / disabled in a live cluster?** - - [X] Feature gate (also fill in values in `kep.yaml`): `KubeletInUserNamespace` - - [ ] Other - - Describe the mechanism: +- [X] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: `KubeletInUserNamespace` + - Components depending on the feature gate: kubelet +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? Enabling `KubeletInUsernamespace` feature gate does not automatically execute kubelet in a user namespace. The user namespace has to be created by RootlessKit before running kubelet. @@ -654,67 +804,211 @@ Note that this feature gate does not support separating kubelet's user namespace node components such as CRI. All the node components must run in the same user namespace. -* **Does enabling the feature change any default behavior?** +###### Does enabling the feature change any default behavior? + -During Alpha, we will document what workloads will work and what will not work. +The limitation is same as Rootless Docker, Podman, etc. +See . -* **Can the feature be disabled once it has been enabled (i.e. can we roll back - the enablement)?** +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? -N/A, as switching back rootless to rootful requires redeploying the kubelet, and vice versa. + -* **Are there any tests for feature enablement/disablement?** +Yes, by turning off the feature gate. + +###### What happens if we reenable the feature if it was previously rolled back? + +Nothing happens. + +###### Are there any tests for feature enablement/disablement? + + -CI will run `kind` (Kubernetes in Docker) tests with Rootless Docker/Podman. -Tests with a real cluster will be added later as well. +Yes. See [Test Plan](#test-plan). ### Rollout, Upgrade and Rollback Planning -_This section must be completed when targeting beta graduation to a release._ + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +Rollout: Rolling out requires recreating a new node instance, in a UserNS. +Typical failures: +- [subuids are not allocated](https://rootlesscontaine.rs/getting-started/common/subuid/) +- [cgroup v2 delegation is not enabled](https://rootlesscontaine.rs/getting-started/common/cgroup2/) + +Rollback: this question is not applicable. Rolling back requires recreating a new node instance. -This section will be fulfilled when targeting beta graduation to a release. +###### What specific metrics should inform a rollback? + + + +CrashLoopBackOffs + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +This question is not applicable. Rolling out and rolling back requires recreating a new node instance. + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +No ### Monitoring Requirements -_This section must be completed when targeting beta graduation to a release._ + -* **How can an operator determine if the feature is in use by workloads?** +###### How can an operator determine if the feature is in use by workloads? -N/A + -* **What are the SLIs (Service Level Indicators) an operator can use to determine -the health of the service?** - - [ ] Metrics - - Metric name: - - [Optional] Aggregation method: - - Components exposing the metric: - - [X] Other (treat as last resort) - - Details: Use `systemctl --user is-system-running` to verify whether the processes (RootlessKit, kubelet, kube-proxy, and CRI) are running. +They can determine if a Pod is running on a node that is running in UserNS. -* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** +###### How can someone using this feature know that it is working for their instance? -N/A + -* **Are there any missing metrics that would be useful to have to improve observability -of this feature?** +- [X] Events + - Event Reason: No CrashLoopBackOff +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: -N/A, but it'd be useful to have the kubelet publish whether or not it is running rootless, as a boolean metric. +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +99.9% of /health requests per day finish with 200 code + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [X] Other (treat as last resort) + - Details: Use `systemctl --user is-system-running` to verify whether the processes (RootlessKit, kubelet, kube-proxy, and CRI) are running. + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +No ### Dependencies + + - Kernel: 5.2 or later is recommended. At least 4.15 or later is required. ([Reason](https://github.com/opencontainers/runc/blob/master/docs/cgroup-v2.md#host-requirements)) - Systemd: 244 or later is recommended. - CRI: containerd >= 1.4, or CRI-O >= 1.22 is required. - OCI: runc >= 1.0-rc91 is required. runc >= 1.0-rc93 is recommended. crun works, too. -_This section must be completed when targeting beta graduation to a release._ +###### Does this feature depend on any specific services running in the cluster? + + -* **Does this feature depend on any specific services running in the cluster?** - [RootlessKit] - Usage description: sets up namespaces, and forwards incoming TCP & UDP packets - Impact of its outage on the feature: kubelet, kube-proxy, CRI, and all container processes will crash, and will be restarted by systemd. @@ -732,58 +1026,146 @@ Both Docker and Podman use RootlessKit and slirp4netns (or VPNkit, optionally) i ### Scalability -_For alpha, this section is encouraged: reviewers should consider these questions -and attempt to answer them._ + -_For GA, this section is required: approvers should be able to confirm the -previous answers based on experience in the field._ +###### Will enabling / using this feature result in any new API calls? -* **Will enabling / using this feature result in any new API calls?** + No. - -* **Will enabling / using this feature result in introducing new API types?** + +###### Will enabling / using this feature result in introducing new API types? + + No. -* **Will enabling / using this feature result in any new calls to the cloud -provider?** +###### Will enabling / using this feature result in any new calls to the cloud provider? + + No. -* **Will enabling / using this feature result in increasing size or count of -the existing API objects?** +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + No. -* **Will enabling / using this feature result in increasing time taken by any -operations covered by [existing SLIs/SLOs]?** +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + No. -* **Will enabling / using this feature result in non-negligible increase of -resource usage (CPU, RAM, disk, IO, ...) in any components?** +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + RootlessKit and slirp4netns may face high CPU and memory consumption. +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +No + ### Troubleshooting + -_This section must be completed when targeting beta graduation to a release._ - -* **How does this feature react if the API server and/or etcd is unavailable?** +###### How does this feature react if the API server and/or etcd is unavailable? Same as traditional rootful Kubernetes. -* **What are other known failure modes?** +###### What are other known failure modes? + + Same as traditional rootful Kubernetes. +###### What steps should be taken if SLOs are not being met to determine the problem? + +- Make sure that the supported version of the components are used +- [Make sure that more than 65536 subuids are allocated](https://rootlesscontaine.rs/getting-started/common/subuid/) +- [Make sure that cgroup v2 delegation is enabled](https://rootlesscontaine.rs/getting-started/common/cgroup2/) + ## Implementation History