K8SPS-357: Improve full cluster crash recovery #928

egegunes · 2025-06-03T16:07:26Z

CHANGE DESCRIPTION

Our full cluster crash recovery procedure requires at least 1 restart in primary and 3 restarts in secondaries:

Cluster started after crash
Pods are started
Full cluster crash detected (1st restart)
Operator reboots the cluster
Secondary pods are restarted to join the cluster (2nd restart)
Secondary pods receive data with Clone (3rd restart)

Even though these restarts are by design, they give the impression something's wrong with the cluster.

These changes attempt to reduce restarts to 1. After a succesful crash recovery, operator deletes all secondary pods so they can join the cluster. Only restart will be the 3rd restart required after clone. Secondary pods will be deleted by best effort. Which means if they can not be deleted, operator won't do anything. In this case secondary pods should be ready to serve traffic after 3-4 restarts.

To recover a cluster from full cluster crash, we use dba.rebootClusterFromCompleteOutage in mysql-shell. This command connects to each MySQL pod to find out the node with the latest transaction and reboots it. This means mysqld needs to be up and running during crash recovery.

After these changes, pods will be marked ready only if MySQL state is ready in $MYSQL_STATE_FILE.

This commit also introduces more events in PerconaServerMySQL:

Events:
  Type     Reason                     Age                     From           Message
  ----     ------                     ----                    ----           -------
  Warning  ClusterStateChanged        6m33s                   ps-controller  -> Initializing
  Warning  ClusterStateChanged        5m10s                   ps-controller  Initializing -> Error
  Warning  FullClusterCrashDetected   3m32s (x23 over 5m10s)  ps-controller  Full cluster crash detected
  Normal   FullClusterCrashRecovered  2m40s                   ps-controller  Cluster recovered from full cluster crash
  Warning  ClusterStateChanged        2s                      ps-controller  Initializing -> Ready

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are all needed new/changed options added to the Helm Chart?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported PS version?
Does the change support oldest and newest supported Kubernetes version?

Our full cluster crash recovery procedure requires at least 1 restart in primary and 3 restarts in secondaries: 1. Cluster started after crash 2. Pods are started 3. Full cluster crash detected (1st restart) 4. Operator reboots the cluster 5. Secondary pods are restarted to join the cluster (2nd restart) 6. Secondary pods receive data with Clone (3rd restart) Even though these restarts are by design, they give the impression something's wrong with the cluster. These changes attempt to reduce restarts to 1. After a succesful crash recovery, operator deletes all secondary pods so they can join the cluster. Only restart will be the 3rd restart required after clone. Secondary pods will be deleted by **best effort**. Which means if they can not be deleted, operator won't do anything. In this case secondary pods should be ready to serve traffic after 3-4 restarts. --- To recover a cluster from full cluster crash, we use `dba.rebootClusterFromCompleteOutage` in mysql-shell. This command connects to each MySQL pod to find out the node with the latest transaction and reboots it. This means mysqld needs to be up and running during crash recovery. After these changes, pods will be marked ready only if MySQL state is ready in `$MYSQL_STATE_FILE`. --- This commit also introduces more events in PerconaServerMySQL: ``` Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ClusterStateChanged 6m33s ps-controller -> Initializing Warning ClusterStateChanged 5m10s ps-controller Initializing -> Error Warning FullClusterCrashDetected 3m32s (x23 over 5m10s) ps-controller Full cluster crash detected Normal FullClusterCrashRecovered 2m40s ps-controller Cluster recovered from full cluster crash Warning ClusterStateChanged 2s ps-controller Initializing -> Ready ```

JNKPercona · 2025-06-04T12:08:31Z

Test name	Status
version-service	passed
async-ignore-annotations	passed
auto-config	passed
config	passed
config-router	passed
demand-backup	passed
gr-demand-backup	passed
gr-demand-backup-haproxy	passed
gr-finalizer	passed
gr-haproxy	passed
gr-ignore-annotations	passed
gr-init-deploy	passed
gr-one-pod	passed
gr-recreate	passed
gr-scaling	passed
gr-scheduled-backup	passed
gr-security-context	passed
gr-self-healing	failure
gr-tls-cert-manager	passed
gr-users	passed
haproxy	passed
init-deploy	passed
limits	passed
monitoring	passed
one-pod	passed
operator-self-healing	passed
recreate	passed
scaling	passed
scheduled-backup	passed
service-per-pod	passed
sidecars	passed
smart-update	passed
tls-cert-manager	passed
users	passed
We run 34 out of 34

commit: c8d26aa
image: perconalab/percona-server-mysql-operator:PR-928-c8d26aaf

pull-request-size bot added the size/L 100-499 lines label Jun 3, 2025

egegunes added this to the v0.11.0 milestone Jun 3, 2025

fix unit tests

c8d26aa

egegunes marked this pull request as ready for review June 5, 2025 10:00

egegunes requested review from hors, pooknull, nmarukovich and gkech as code owners June 5, 2025 10:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

K8SPS-357: Improve full cluster crash recovery #928

K8SPS-357: Improve full cluster crash recovery #928

egegunes commented Jun 3, 2025 •

edited by jira bot

Loading

Uh oh!

JNKPercona commented Jun 4, 2025

Uh oh!

Uh oh!

K8SPS-357: Improve full cluster crash recovery #928

Are you sure you want to change the base?

K8SPS-357: Improve full cluster crash recovery #928

Conversation

egegunes commented Jun 3, 2025 • edited by jira bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CHANGE DESCRIPTION

CHECKLIST

Uh oh!

JNKPercona commented Jun 4, 2025

Uh oh!

Uh oh!

egegunes commented Jun 3, 2025 •

edited by jira bot

Loading