Properly handle existing GCP firewall rules #348

HarrisonWAffel · 2025-09-16T15:29:25Z

Issue: rancher/rancher#51856

Problem

When provisioning GCE node driver clusters which have several machines (3+) in a given pool, the machine provisioning jobs may encounter a transient error when attempting to create internal or external firewall rules. This is due to a race condition between multiple provisioning jobs attempting to create a firewall rule using the same name at the same time.

rancher-machine is expected to first detect if a rule exists and use it if it does. In this case, the rule is being created by one pod in the short amount of time between the initial check and the creation attempt within another pod.

This error does not prevent GCE clusters from being provisioned successfully, but does result in some churn in the machine provision pods and GCE VM instances.

Solution

Do not return an error and use the existing firewall rule.

This PR also fixes a flakey GCE test that relied on indexing a slice which was populated using a map - which resulted in occasional misordering.

Testing

I've built a custom version of rancher-machine and used it in a development environment.

provision a single pool GCE cluster which three all role nodes.
compare the logs shown in each machine provision job and confirm that the relevant log messages are shown
confirm that no error is seen in the Rancher UI, and that the cluster becomes available.

I did this for both the internal firewall rule as well as the external rule by settings external ports.

Additional Information

This error does not reliably reproduce in resource constrained environments (e.g. 2 cpu, 4gb ram), as all three jobs may not start at the same time. This is likely why it was not seen during development of the GCE UI, or later validation. Larger VMs should be used to host the Rancher server, to ensure all jobs can race.

jiaqiluo

LGTM.

fix: do not return an error if firewall rules already exist

421fff0

HarrisonWAffel requested a review from a team September 16, 2025 15:29

fix: update compute_util_test.go to be less flakey

88ce46a

jakefhyde approved these changes Sep 22, 2025

View reviewed changes

jakefhyde requested a review from a team September 22, 2025 20:53

jiaqiluo approved these changes Sep 23, 2025

View reviewed changes

HarrisonWAffel merged commit 1e6a7eb into rancher:master Sep 24, 2025
1 check passed

HarrisonWAffel deleted the update-gce-labels branch September 24, 2025 15:53

This was referenced Sep 24, 2025

[v2.12] Bump rancher machine v0.15.0 rancher133 rancher/rancher#52084

Merged

[main] Bump rancher-machine to v0.15.0-rancher133 rancher/rancher#52085

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Properly handle existing GCP firewall rules #348

Properly handle existing GCP firewall rules #348

Uh oh!

HarrisonWAffel commented Sep 16, 2025 •

edited

Loading

Uh oh!

jiaqiluo left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Properly handle existing GCP firewall rules #348

Properly handle existing GCP firewall rules #348

Uh oh!

Conversation

HarrisonWAffel commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue: rancher/rancher#51856

Problem

Solution

Testing

Additional Information

Uh oh!

jiaqiluo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HarrisonWAffel commented Sep 16, 2025 •

edited

Loading