Skip to content

Latest commit

 

History

History
96 lines (61 loc) · 7.82 KB

File metadata and controls

96 lines (61 loc) · 7.82 KB

Module: garden

Purpose

What?

This module provides chaostoolkit actions to simulate zone outages and disrupt pods in Gardener-managed clusters. It supports:

  • Compute: Termination or hard restart/reboot of nodes in one zone with a min/max lifetime (e.g. 0s-0s to shoot down any machine right when it tries to come up or e.g. 10-60s to let them come up at least for 10s but shoot them down at the latest after 60s).
  • Network: Blocking only ingress or only egress or all network traffic for nodes in one zone.
  • Pods: Termination of control plane pods (depends on your access permissions - end users have no access), system component pods (Gardener-managed addons in your kube-system namespace), or pods in general in one zone with a min/max lifetime (e.g. 0s-0s to shoot down any pod right when it tries to come up or e.g. 10-60s to let them come up at least for 10s but shoot them down at the latest after 60s) with or without a grace period.

⚠️ If you block network traffic one way, e.g. ingress (resp. egress), the other way, then egress (resp. ingress), is fully opened, so use with care.

You can run the above in parallel, even of the same type, as long as the targeted zones differ. This way you can also test whether you recover after a multi-zonal outage.

This module also provides chaostoolkit probes:

  • Health Probe: Probes various Gardener-managed cluster functions in parallel. See k8s for details.

How?

  • Compute and Network: See cloud provider specific docs.
  • Pods: Based on the given zone and filters, pods are identified busily/continuously and terminated with or without a grace period. You may provide a min/max lifetime to make the process more random, chaotic, and unpredictable, which may further help you unearth issues.
  • Health Probe: Deploys probes into the cluster that busily/continuously probe various Gardener-managed cluster functions in parallel. This operation must be rolled back when completed.

Why?

Developing highly available workload that can tolerate a zone outage is no trivial task. You can find more information on how to achieve this goal here. To put your solution to the test, this module will help you.

The probe on the other hand is targeting Gardener developers and output-qualification and puts Gardener HA as such to the test, which requires automation as Gardener-managed clusters perform many functions in parallel.

Usage

Actions and Rollbacks

chaostoolkit introduces so-called actions that can be composed into experiments that perform operations against a system (here a Gardener-managed Kubernetes cluster). The following actions (and explicit rollbacks) are supported:

Module: chaosgarden.garden.actions

  • assess_cloud_provider_filters_impact: Show which machines/networks would be affected by the given zone and filters. Useful in combination with wait-for before launching the actual action.

  • run_cloud_provider_compute_failure_simulation: Run compute failure simulation.

  • run_cloud_provider_compute_failure_simulation_in_background: Same as above, but running in background as a thread. Normally not used with experiments, but directly in Python (scripts).

  • run_cloud_provider_network_failure_simulation: Run network failure simulation.

  • rollback_cloud_provider_network_failure_simulation: Rollback network failure simulation explicitly (usually performed automatically above, but can also be invoked explicitly as rollback step in an experiment to deal with interruptions).

  • run_cloud_provider_network_failure_simulation_in_background: Same as above, but running in background as a thread. Normally not used with experiments, but directly in Python (scripts).

  • run_control_plane_pod_failure_simulation: Run control plane pod failure simulation (depends on your access permissions - end users have no access).

  • run_control_plane_pod_failure_simulation_in_background: Same as above, but running in background as a thread. Normally not used with experiments, but directly in Python (scripts).

  • run_system_components_pod_failure_simulation: Run system component pod failure simulation (Gardener-managed addons in your kube-system namespace).

  • run_system_components_pod_failure_simulation_in_background: Same as above, but running in background as a thread. Normally not used with experiments, but directly in Python (scripts).

  • run_general_pod_failure_simulation: Run general pod failure simulation.

  • run_general_pod_failure_simulation_in_background: Same as above, but running in background as a thread. Normally not used with experiments, but directly in Python (scripts).

  • run_shoot_cluster_health_probe: Run shoot cluster health probe (usually only interesting to Gardener developers).

  • rollback_shoot_cluster_health_probe: Rollback shoot cluster health probe explicitly (usually performed automatically above, but can also be invoked explicitly as rollback step in an experiment to deal with interruptions).

  • run_shoot_cluster_health_probe_in_background: Same as above, but running in background as a thread. Normally not used with experiments, but directly in Python (scripts).

Pod Selectors

The following pod selectors are supported:

  • pod_node_label_selector, e.g. topology.kubernetes.io/zone=world-1a,worker.gardener.cloud/pool=cpu-worker,..., right-hand side may be a regex, operators are =|==|!=|=~|!~
  • pod_label_selector, e.g. gardener.cloud/role=controlplane,gardener.cloud/role=vpa,..., regular pod label selector (not interpreted by chaosgarden)
  • pod_metadata_selector, e.g. namespace=kube-system,name=kube-apiserver.*,..., right-hand side may be a regex, operators are =|==|!=|=~|!~
  • pod_owner_selector, e.g. kind!=DaemonSet,name=kube-apiserver.*,..., right-hand side may be a regex, operators are =|==|!=|=~|!~

Configuration

The following configuration fields are mandatory:

  • project: Gardener project name
  • shoot: Shoot cluster name

Secrets

The following secret field is optional:

  • kubeconfig_path: Path to kubeconfig file with Garden cluster configuration and credentials

You can omit this field if $KUBECONFIG points to your kubeconfig file (default).

Examples