Properly handle existing GCP firewall rules #348
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue: rancher/rancher#51856
Problem
When provisioning GCE node driver clusters which have several machines (3+) in a given pool, the machine provisioning jobs may encounter a transient error when attempting to create internal or external firewall rules. This is due to a race condition between multiple provisioning jobs attempting to create a firewall rule using the same name at the same time.
rancher-machine is expected to first detect if a rule exists and use it if it does. In this case, the rule is being created by one pod in the short amount of time between the initial check and the creation attempt within another pod.
This error does not prevent GCE clusters from being provisioned successfully, but does result in some churn in the machine provision pods and GCE VM instances.
Solution
Do not return an error and use the existing firewall rule.
This PR also fixes a flakey GCE test that relied on indexing a slice which was populated using a map - which resulted in occasional misordering.
Testing
I've built a custom version of rancher-machine and used it in a development environment.
I did this for both the internal firewall rule as well as the external rule by settings external ports.
Additional Information
This error does not reliably reproduce in resource constrained environments (e.g. 2 cpu, 4gb ram), as all three jobs may not start at the same time. This is likely why it was not seen during development of the GCE UI, or later validation. Larger VMs should be used to host the Rancher server, to ensure all jobs can race.