Skip to content

K8SPS-357: Improve full cluster crash recovery #928

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

egegunes
Copy link
Contributor

@egegunes egegunes commented Jun 3, 2025

K8SPS-357 Powered by Pull Request Badge

CHANGE DESCRIPTION

Our full cluster crash recovery procedure requires at least 1 restart in primary and 3 restarts in secondaries:

  1. Cluster started after crash
  2. Pods are started
  3. Full cluster crash detected (1st restart)
  4. Operator reboots the cluster
  5. Secondary pods are restarted to join the cluster (2nd restart)
  6. Secondary pods receive data with Clone (3rd restart)

Even though these restarts are by design, they give the impression something's wrong with the cluster.

These changes attempt to reduce restarts to 1. After a succesful crash recovery, operator deletes all secondary pods so they can join the cluster. Only restart will be the 3rd restart required after clone. Secondary pods will be deleted by best effort. Which means if they can not be deleted, operator won't do anything. In this case secondary pods should be ready to serve traffic after 3-4 restarts.


To recover a cluster from full cluster crash, we use dba.rebootClusterFromCompleteOutage in mysql-shell. This command connects to each MySQL pod to find out the node with the latest transaction and reboots it. This means mysqld needs to be up and running during crash recovery.

After these changes, pods will be marked ready only if MySQL state is ready in $MYSQL_STATE_FILE.


This commit also introduces more events in PerconaServerMySQL:

Events:
  Type     Reason                     Age                     From           Message
  ----     ------                     ----                    ----           -------
  Warning  ClusterStateChanged        6m33s                   ps-controller  -> Initializing
  Warning  ClusterStateChanged        5m10s                   ps-controller  Initializing -> Error
  Warning  FullClusterCrashDetected   3m32s (x23 over 5m10s)  ps-controller  Full cluster crash detected
  Normal   FullClusterCrashRecovered  2m40s                   ps-controller  Cluster recovered from full cluster crash
  Warning  ClusterStateChanged        2s                      ps-controller  Initializing -> Ready

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PS version?
  • Does the change support oldest and newest supported Kubernetes version?

Our full cluster crash recovery procedure requires at least 1 restart in
primary and 3 restarts in secondaries:

1. Cluster started after crash
2. Pods are started
3. Full cluster crash detected (1st restart)
4. Operator reboots the cluster
5. Secondary pods are restarted to join the cluster (2nd restart)
6. Secondary pods receive data with Clone (3rd restart)

Even though these restarts are by design, they give the impression
something's wrong with the cluster.

These changes attempt to reduce restarts to 1. After a succesful crash
recovery, operator deletes all secondary pods so they can join the
cluster. Only restart will be the 3rd restart required after clone.
Secondary pods will be deleted by **best effort**. Which means if they
can not be deleted, operator won't do anything. In this case secondary
pods should be ready to serve traffic after 3-4 restarts.

---

To recover a cluster from full cluster crash, we use
`dba.rebootClusterFromCompleteOutage` in mysql-shell. This command
connects to each MySQL pod to find out the node with the latest
transaction and reboots it. This means mysqld needs to be up and running
during crash recovery.

After these changes, pods will be marked ready only if MySQL state is
ready in `$MYSQL_STATE_FILE`.

---

This commit also introduces more events in PerconaServerMySQL:

```
Events:
  Type     Reason                     Age                     From           Message
  ----     ------                     ----                    ----           -------
  Warning  ClusterStateChanged        6m33s                   ps-controller  -> Initializing
  Warning  ClusterStateChanged        5m10s                   ps-controller  Initializing -> Error
  Warning  FullClusterCrashDetected   3m32s (x23 over 5m10s)  ps-controller  Full cluster crash detected
  Normal   FullClusterCrashRecovered  2m40s                   ps-controller  Cluster recovered from full cluster crash
  Warning  ClusterStateChanged        2s                      ps-controller  Initializing -> Ready
```
@pull-request-size pull-request-size bot added the size/L 100-499 lines label Jun 3, 2025
@egegunes egegunes added this to the v0.11.0 milestone Jun 3, 2025
@JNKPercona
Copy link
Collaborator

Test name Status
version-service passed
async-ignore-annotations passed
auto-config passed
config passed
config-router passed
demand-backup passed
gr-demand-backup passed
gr-demand-backup-haproxy passed
gr-finalizer passed
gr-haproxy passed
gr-ignore-annotations passed
gr-init-deploy passed
gr-one-pod passed
gr-recreate passed
gr-scaling passed
gr-scheduled-backup passed
gr-security-context passed
gr-self-healing failure
gr-tls-cert-manager passed
gr-users passed
haproxy passed
init-deploy passed
limits passed
monitoring passed
one-pod passed
operator-self-healing passed
recreate passed
scaling passed
scheduled-backup passed
service-per-pod passed
sidecars passed
smart-update passed
tls-cert-manager passed
users passed
We run 34 out of 34

commit: c8d26aa
image: perconalab/percona-server-mysql-operator:PR-928-c8d26aaf

@egegunes egegunes marked this pull request as ready for review June 5, 2025 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/L 100-499 lines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants