This module provides chaostoolkit actions to simulate zone outages and disrupt pods in Gardener-managed clusters. It supports:
- Compute: Termination or hard restart/reboot of nodes in one zone with a min/max lifetime (e.g. 0s-0s to shoot down any machine right when it tries to come up or e.g. 10-60s to let them come up at least for 10s but shoot them down at the latest after 60s).
- Network: Blocking only ingress or only egress or all network traffic for nodes in one zone.
- Pods: Termination of control plane pods (depends on your access permissions - end users have no access), system component pods (Gardener-managed addons in your
kube-systemnamespace), or pods in general in one zone with a min/max lifetime (e.g. 0s-0s to shoot down any pod right when it tries to come up or e.g. 10-60s to let them come up at least for 10s but shoot them down at the latest after 60s) with or without a grace period.
You can run the above in parallel, even of the same type, as long as the targeted zones differ. This way you can also test whether you recover after a multi-zonal outage.
This module also provides chaostoolkit probes:
- Health Probe: Probes various Gardener-managed cluster functions in parallel. See
k8sfor details.
- Compute and Network: See cloud provider specific docs.
- Pods: Based on the given zone and filters, pods are identified busily/continuously and terminated with or without a grace period. You may provide a min/max lifetime to make the process more random, chaotic, and unpredictable, which may further help you unearth issues.
- Health Probe: Deploys probes into the cluster that busily/continuously probe various Gardener-managed cluster functions in parallel. This operation must be rolled back when completed.
Developing highly available workload that can tolerate a zone outage is no trivial task. You can find more information on how to achieve this goal here. To put your solution to the test, this module will help you.
The probe on the other hand is targeting Gardener developers and output-qualification and puts Gardener HA as such to the test, which requires automation as Gardener-managed clusters perform many functions in parallel.
chaostoolkit introduces so-called actions that can be composed into experiments that perform operations against a system (here a Gardener-managed Kubernetes cluster). The following actions (and explicit rollbacks) are supported:
Module: chaosgarden.garden.actions
-
assess_cloud_provider_filters_impact: Show which machines/networks would be affected by the given zone and filters. Useful in combination with wait-for before launching the actual action. -
run_cloud_provider_compute_failure_simulation: Run compute failure simulation. -
run_cloud_provider_compute_failure_simulation_in_background: Same as above, but running in background as a thread. Normally not used with experiments, but directly in Python (scripts). -
run_cloud_provider_network_failure_simulation: Run network failure simulation. -
rollback_cloud_provider_network_failure_simulation: Rollback network failure simulation explicitly (usually performed automatically above, but can also be invoked explicitly as rollback step in an experiment to deal with interruptions). -
run_cloud_provider_network_failure_simulation_in_background: Same as above, but running in background as a thread. Normally not used with experiments, but directly in Python (scripts). -
run_control_plane_pod_failure_simulation: Run control plane pod failure simulation (depends on your access permissions - end users have no access). -
run_control_plane_pod_failure_simulation_in_background: Same as above, but running in background as a thread. Normally not used with experiments, but directly in Python (scripts). -
run_system_components_pod_failure_simulation: Run system component pod failure simulation (Gardener-managed addons in yourkube-systemnamespace). -
run_system_components_pod_failure_simulation_in_background: Same as above, but running in background as a thread. Normally not used with experiments, but directly in Python (scripts). -
run_general_pod_failure_simulation: Run general pod failure simulation. -
run_general_pod_failure_simulation_in_background: Same as above, but running in background as a thread. Normally not used with experiments, but directly in Python (scripts). -
run_shoot_cluster_health_probe: Run shoot cluster health probe (usually only interesting to Gardener developers). -
rollback_shoot_cluster_health_probe: Rollback shoot cluster health probe explicitly (usually performed automatically above, but can also be invoked explicitly as rollback step in an experiment to deal with interruptions). -
run_shoot_cluster_health_probe_in_background: Same as above, but running in background as a thread. Normally not used with experiments, but directly in Python (scripts).
The following pod selectors are supported:
pod_node_label_selector, e.g.topology.kubernetes.io/zone=world-1a,worker.gardener.cloud/pool=cpu-worker,..., right-hand side may be a regex, operators are=|==|!=|=~|!~pod_label_selector, e.g.gardener.cloud/role=controlplane,gardener.cloud/role=vpa,..., regular pod label selector (not interpreted bychaosgarden)pod_metadata_selector, e.g.namespace=kube-system,name=kube-apiserver.*,..., right-hand side may be a regex, operators are=|==|!=|=~|!~pod_owner_selector, e.g.kind!=DaemonSet,name=kube-apiserver.*,..., right-hand side may be a regex, operators are=|==|!=|=~|!~
The following configuration fields are mandatory:
project: Gardener project nameshoot: Shoot cluster name
The following secret field is optional:
kubeconfig_path: Path tokubeconfigfile with Garden cluster configuration and credentials
You can omit this field if $KUBECONFIG points to your kubeconfig file (default).
-
Run Shoot Cluster Health Probe as Hypothesis (doesn't really fit as it must run in background, which is not supported by
chaostoolkit) -
Run Shoot Cluster Health Probe as Method (the better alternative and almost identical in
chaostoolkitbehavior) -
Explicit Garden Secrets (if you do not want to use
$KUBECONFIG)