Skip to content

Conversation

@HarrisonWAffel
Copy link

@HarrisonWAffel HarrisonWAffel commented Sep 16, 2025

Issue: rancher/rancher#51856

Problem

When provisioning GCE node driver clusters which have several machines (3+) in a given pool, the machine provisioning jobs may encounter a transient error when attempting to create internal or external firewall rules. This is due to a race condition between multiple provisioning jobs attempting to create a firewall rule using the same name at the same time.

rancher-machine is expected to first detect if a rule exists and use it if it does. In this case, the rule is being created by one pod in the short amount of time between the initial check and the creation attempt within another pod.

This error does not prevent GCE clusters from being provisioned successfully, but does result in some churn in the machine provision pods and GCE VM instances.

Solution

Do not return an error and use the existing firewall rule.

This PR also fixes a flakey GCE test that relied on indexing a slice which was populated using a map - which resulted in occasional misordering.

Testing

I've built a custom version of rancher-machine and used it in a development environment.

  • provision a single pool GCE cluster which three all role nodes.
  • compare the logs shown in each machine provision job and confirm that the relevant log messages are shown
  • confirm that no error is seen in the Rancher UI, and that the cluster becomes available.

I did this for both the internal firewall rule as well as the external rule by settings external ports.

Additional Information

This error does not reliably reproduce in resource constrained environments (e.g. 2 cpu, 4gb ram), as all three jobs may not start at the same time. This is likely why it was not seen during development of the GCE UI, or later validation. Larger VMs should be used to host the Rancher server, to ensure all jobs can race.

@HarrisonWAffel HarrisonWAffel requested a review from a team September 16, 2025 15:29
@jakefhyde jakefhyde requested a review from a team September 22, 2025 20:53
Copy link
Member

@jiaqiluo jiaqiluo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@HarrisonWAffel HarrisonWAffel merged commit 1e6a7eb into rancher:master Sep 24, 2025
1 check passed
@HarrisonWAffel HarrisonWAffel deleted the update-gce-labels branch September 24, 2025 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants