Skip to content

Make the apiversion-upgrade management cluster HA#6329

Open
mboersma wants to merge 1 commit into
kubernetes-sigs:mainfrom
mboersma:ha-upgrade-mgmt-cluster
Open

Make the apiversion-upgrade management cluster HA#6329
mboersma wants to merge 1 commit into
kubernetes-sigs:mainfrom
mboersma:ha-upgrade-mgmt-cluster

Conversation

@mboersma
Copy link
Copy Markdown
Contributor

@mboersma mboersma commented Jun 1, 2026

What type of PR is this?

/kind flake

What this PR does / why we need it:

The apiversion-upgrade job creates a self-hosted management cluster, upgrades all
providers on it with clusterctl upgrade apply, then scales workload clusters. CAPI's
ClusterctlUpgradeSpec hardcodes a single control-plane machine for that management
cluster, so its public API server load balancer has one backend. Under the load of the
provider upgrade, that single node can briefly go unreachable, which surfaces as
"provider not ready after 5m0s" and fails the (required) job. It currently passes less
than 20% of the time.

This vendors ClusterctlUpgradeSpec into CAPZ with one addition, a
ManagementClusterControlPlaneMachineCount field, and runs the upgrade specs with a
3-node HA management control plane. The clone is a near-verbatim copy of CAPI v1.13.2;
the only functional change is marked "CAPZ ADDITION".

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

See kubernetes-sigs/cluster-api#13766.

The intent is to prove out the fix here, then cherry-pick the
ManagementClusterControlPlaneMachineCount field upstream to cluster-api. Once it lands
upstream, test/e2e/clusterctl_upgrade.go should be deleted and the callers switched
back to capi_e2e.ClusterctlUpgradeSpec.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • cherry-pick candidate

Release note:

Make the apiversion-upgrade management cluster HA

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/flake Categorizes issue or PR as related to a flaky test. labels Jun 1, 2026
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 1, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jackfrancis for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mboersma mboersma force-pushed the ha-upgrade-mgmt-cluster branch from c6ab190 to 0ac7656 Compare June 1, 2026 19:47
@mboersma
Copy link
Copy Markdown
Contributor Author

mboersma commented Jun 1, 2026

/test pull-cluster-api-provider-azure-apiversion-upgrade

@kubernetes-sigs kubernetes-sigs deleted a comment from k8s-ci-robot Jun 1, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.85%. Comparing base (f5bc974) to head (fc83052).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6329   +/-   ##
=======================================
  Coverage   43.85%   43.85%           
=======================================
  Files         291      291           
  Lines       25344    25344           
=======================================
  Hits        11114    11114           
  Misses      13457    13457           
  Partials      773      773           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mboersma
Copy link
Copy Markdown
Contributor Author

mboersma commented Jun 1, 2026

@mboersma mboersma force-pushed the ha-upgrade-mgmt-cluster branch from 0ac7656 to fc83052 Compare June 1, 2026 21:47
@mboersma
Copy link
Copy Markdown
Contributor Author

mboersma commented Jun 1, 2026

/test pull-cluster-api-provider-azure-apiversion-upgrade

@mboersma
Copy link
Copy Markdown
Contributor Author

mboersma commented Jun 2, 2026

Passed again on the second try: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/6329/pull-cluster-api-provider-azure-apiversion-upgrade/2061568060594065408

Given that this job only passes ~ 20% of the time, I think this fix is working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

2 participants