MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error

### What steps did you take and what happened?

When calling clusterctlclient.Client.ApplyUpgrade(upgrade) to upgrade CAPI core components (its version is not changed) and a CAPI Infra Provider component(the version is changed), there is a very low probability that capi-controller-manager pod is restarted. Both capi-controller-manager pod log and pod previous log contains the error log "Unable to retrieve Node status":

> E0223 18:31:51.557569 1 machineset_controller.go:883] **"Unable to retrieve Node status" err="failed to create cluster accessor: failed to get lock for cluster: cluster is locked already"** controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w-workers-bkzcs" namespace="e2e-mycluster-v1-24-106-sks-upgrade" name="e2e-50a8u5-sks-upgrade-3m3w-workers-bkzcs" reconcileID=b9a3b2d2-00e9-4d0f-97b4-f2448292404d MachineDeployment="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w-workers" Cluster="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w" Machine="e2e-mycluster-v1-24-106-sks-upgrade/e2e-50a8u5-sks-upgrade-3m3w-workers-bkzcs-75tm4" node=""

This error causes the MD.Status.ReadyReplicas changes from 3 to 0 and after about 90s it will be changed back to 3. The reason is updateStatus() in [machineset_controller.go](https://github.com/kubernetes-sigs/cluster-api/blob/v1.5.2/internal/controllers/machineset/machineset_controller.go#L863-L884) ignores the error returned by getMachineNode() and treats the Node as not ready.  In the mean time, KCP.Status.ReadyReplicas changes from 3 to 2 and back to 3 (after about only 8 seconds), and the reason might be[ kcp Reconcile()](https://github.com/kubernetes-sigs/cluster-api/blob/v1.5.6/controlplane/kubeadm/internal/controllers/controller.go#L207-L264) issues a requeue immediately after hitting ErrClusterLocked error.

Our code on top of CAPI watches MD.Status.ReadyReplicas and leads to an issue when MD.Status.ReadyReplicas changes from 3 to 0.

### What did you expect to happen?

- MD.Status.ReadyReplicas should not change from 3 to 0 when (at least) hitting ErrClusterLocked error and even other errors, because the Nodes are ready actually.
- KCP.Status.ReadyReplicas should not change either when hitting ErrClusterLocked error.

### Cluster API version

1.5.2

### Kubernetes version

1.24.17

### Anything else you would like to add?

To avoid MD.Status.ReadyReplicas change in this case, we can `return error` rather than `contiune` at 
https://github.com/kubernetes-sigs/cluster-api/blob/v1.5.2/internal/controllers/machineset/machineset_controller.go#L882-L884 when the error is ErrClusterLocked (or even return error on any error).

### Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error #10195

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MD.Status.ReadyReplicas changes from 3 to 0 when machineset_controller updateStatus() hits "Unable to retrieve Node status" error #10195

Description

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions