OCPEDGE-1500: Added scaling strategies for TNF #1396

jaypoulz · 2025-02-19T22:14:58Z

This PR updates the scaling strategy options available in CEO to provide options that are compatible with TNF.
This PR depends on openshift/api#2196

openshift-ci-robot · 2025-02-19T22:15:04Z

@jaypoulz: This pull request references Jira Issue OCPBUGS-1500, which is invalid:

expected the bug to be open, but it isn't
expected the bug to target either version "4.19." or "openshift-4.19.", but it targets "4.11.z" instead
expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is Closed (Won't Do) instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-02-19T22:20:16Z

@jaypoulz: This pull request references OCPEDGE-1500 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

This PR updates the scaling strategy options available in CEO to provide options that are compatible with TNF.
This PR depends on openshift/api#2196

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jaypoulz · 2025-02-19T22:23:35Z

/jira refresh

openshift-ci-robot · 2025-02-19T22:23:40Z

@jaypoulz: This pull request references OCPEDGE-1500 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jaypoulz · 2025-02-21T01:27:39Z

/retest-required

jaypoulz · 2025-02-25T14:45:43Z

/retest-required

jaypoulz · 2025-02-26T13:43:11Z

/retest-required

jaypoulz · 2025-02-26T16:30:11Z

/retest

jaypoulz · 2025-02-26T22:23:19Z

/retest

jaypoulz · 2025-02-27T13:32:40Z

/retest

jaypoulz · 2025-02-28T00:15:14Z

/retest-required

jaypoulz · 2025-02-28T19:09:21Z

/retest

jaypoulz · 2025-03-03T12:36:54Z

/restest-required

jaypoulz · 2025-03-03T19:05:40Z

/retest-required

jaypoulz · 2025-03-05T00:47:58Z

So the e2e tests are real failures - not connected to the CEO code, but rather the API bump.
I don't think the changes I specifically rely on in the API are relevant to the broken tests, but I think we'll need a holistic review of the API updates if we're going to get this merged.

dusk125 · 2025-03-05T16:08:47Z

@jaypoulz now that the API has merged, can you update the go.mod to consume it (not from Egli's repo?) I can start reviewing and debugging from there.

jaypoulz · 2025-03-05T16:19:46Z

Here's the updated API :)
I've been debugging locally, and I discovered that part of my issue was that my cluster didn't have the backups featuregate enabled. Trying to work through the e2e tests one at a time. :)

jaypoulz · 2025-03-05T19:31:43Z

After enabling the featuregates, I've been able to get all of the e2e tests to pass locally.
I'm not convinced that the operator test failures are real. The test run for ci/prow/e2e-aws-ovn-etcd-scaling hit unrelated issues.

jaypoulz · 2025-03-05T21:26:44Z

/retest-required

jaypoulz · 2025-03-05T21:28:00Z

Based on my local testing, I'm confident there isn't anything in my code that is affecting the CI jobs. Sounding through another round of retests because I'm confident they'll have a good chance at passing.

dusk125 · 2025-03-06T15:19:53Z

/retest-required

dusk125 · 2025-03-06T14:59:05Z

pkg/cmd/render/render.go

+	// The only code that references the scaling strategy written to the
+	// manifests lives in the 00_etcd-endpoints-cm.yaml and etcd-member-pod.yaml
+	// templates, which just check for BootstrapInPlaceStrategy (see logic above).
+	case cpReplicaCount == 1:


There might be a case here where the delayed HA strategy is not being returned properly as previously it was returned immediately. Where here it's possible for one of the other paths to execute first (for non-2no clusters)

There is definitely some risk introduced by this.
The test cases that should cover this are:
[new] Testing DelayedHA with Arbiter, 2 CP nodes
[old] Delayed HA with Standard HA, 3 CP nodes

The only other relevant cases would be delayed annotation with 1 CP node, which I don't think we have a test case for. What is the expected response for that? In my mind, single node should always respond BootstrapInPlace place if that's set or UnsafeScalingStrategy if not, but the old logic would short circuit and respond with DelayedHA.

Anything with more than 3 CP nodes should be the same as the the [old] DelayedHA test.

We could alleviate the risk by moving the ==1 case to after the delayed case since the other cases are specifically for 2NO.
That should result in the same default behavior

Another important node is the comment I've made above this block. I don't think the scaling strategy specified by the render function is actually used anywhere (besides the detection of the bootstrap scaling strategy).

The logic using the scaling strategy (e.g. CheckSafeToScaleCluster in bootstrap.go and sync in bootstrap_teardown_controller.go), call the GetBootstrapScalingStrategy function in ceohelpers/bootstrap.go.
If you look at that file, you'll notice that BootstrapScalingStrategy isn't even an option that this function returns.

The render.go scaling strategy function is only used by render.go => Run() => newTemplateData(r)
This template data is only referenced when creating manifests - and specifically, it's only there for injecting the bootstrap IP:

00_etcd-endpoints-cm.yaml

etcd-member-pod.yaml

IMHO, #547 would have been better off not introducing the exception for BootstrapInPlace as a scaling strategy since it's not solving the same problems as the others.

The simpler way would have been to expose a BootstrapMode to the templateData, which is only set to InPlace when the annotation is set. Then you could drop all logic related to the ScalingStrategy from render.go.

Pushed a new version that incorporates the suggestion in #1396 (comment).

…ncing (TNF) - TwoNodeScalingStrategy is for core installer installations fo Two Node OpenShift with Fencing - DelayedTwoNodeScalingStrategy is for assisted installs of Two Node OpenShift with Fencing Full change list: - Added updated with bootstrap scaling logic for TNF - Updated manifests render function to know about new topologies - Added a comment to health.go to explain why there isn't a TNF override directly in the quorum fault tolerance check function - Updated bootstrap teardown logic to handle TNF cases - Fixed quorum_check_test to check for the updated error message - Removed the single node topology check utility function - Enabled defragcontroller for Two Node OpenShift with Fencing and Two Node OpenShift with Arbiter

dusk125 · 2025-03-06T20:07:48Z

/lgtm

openshift-ci · 2025-03-06T20:08:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dusk125, jaypoulz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dusk125]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2025-03-06T22:02:32Z

/retest-required

Remaining retests: 0 against base HEAD 5bbe494 and 2 for PR HEAD c071422 in total

openshift-ci-robot · 2025-03-07T16:06:21Z

/retest-required

Remaining retests: 0 against base HEAD 5bbe494 and 2 for PR HEAD c071422 in total

openshift-ci · 2025-03-07T18:49:58Z

@jaypoulz: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-disruptive	`c071422`	link	false	`/test e2e-aws-disruptive`
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown	`c071422`	link	false	`/test e2e-metal-ovn-ha-cert-rotation-shutdown`
ci/prow/e2e-gcp-ovn-etcd-scaling	`c071422`	link	false	`/test e2e-gcp-ovn-etcd-scaling`
ci/prow/e2e-gcp-disruptive	`c071422`	link	false	`/test e2e-gcp-disruptive`
ci/prow/e2e-aws-etcd-certrotation	`c071422`	link	false	`/test e2e-aws-etcd-certrotation`
ci/prow/e2e-aws-etcd-recovery	`c071422`	link	false	`/test e2e-aws-etcd-recovery`
ci/prow/configmap-scale	`c071422`	link	false	`/test configmap-scale`
ci/prow/e2e-gcp-disruptive-ovn	`c071422`	link	false	`/test e2e-gcp-disruptive-ovn`
ci/prow/e2e-vsphere-ovn-etcd-scaling	`c071422`	link	false	`/test e2e-vsphere-ovn-etcd-scaling`
ci/prow/e2e-azure-ovn-etcd-scaling	`c071422`	link	false	`/test e2e-azure-ovn-etcd-scaling`
ci/prow/e2e-azure	`c071422`	link	false	`/test e2e-azure`
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown	`c071422`	link	false	`/test e2e-metal-ovn-sno-cert-rotation-shutdown`
ci/prow/e2e-aws-disruptive-ovn	`c071422`	link	false	`/test e2e-aws-disruptive-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-bot · 2025-03-07T23:08:48Z

[ART PR BUILD NOTIFIER]

Distgit: cluster-etcd-operator
This PR has been included in build cluster-etcd-operator-container-v4.19.0-202503072239.p0.gba5c2e8.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 19, 2025

openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Feb 19, 2025

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Feb 19, 2025

openshift-ci bot requested review from Elbehery and tjungblu February 19, 2025 22:15

jaypoulz changed the title ~~WIP: OCPBUGS-1500: Added scaling strategies for TNF~~ WIP: OCPEDGE-1500: Added scaling strategies for TNF Feb 19, 2025

openshift-ci-robot removed jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 19, 2025

jaypoulz force-pushed the OCPBUGS-1500-tns-scaling-strategies branch 2 times, most recently from f6bf5d8 to 9cc6039 Compare February 20, 2025 22:08

jaypoulz force-pushed the OCPBUGS-1500-tns-scaling-strategies branch from 9cc6039 to e489d1f Compare February 24, 2025 17:04

jaypoulz changed the title ~~WIP: OCPEDGE-1500: Added scaling strategies for TNF~~ OCPEDGE-1500: Added scaling strategies for TNF Feb 24, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 24, 2025

jaypoulz force-pushed the OCPBUGS-1500-tns-scaling-strategies branch 3 times, most recently from 584a5fc to 746a469 Compare February 25, 2025 20:59

jaypoulz force-pushed the OCPBUGS-1500-tns-scaling-strategies branch from 746a469 to 706ad9b Compare March 5, 2025 16:18

dusk125 reviewed Mar 6, 2025

View reviewed changes

jaypoulz force-pushed the OCPBUGS-1500-tns-scaling-strategies branch from 706ad9b to c071422 Compare March 6, 2025 19:09

openshift-ci bot assigned dusk125 Mar 6, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 6, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2025

openshift-merge-bot bot merged commit ba5c2e8 into openshift:main Mar 7, 2025
20 of 33 checks passed

OCPEDGE-1500: Added scaling strategies for TNF #1396

OCPEDGE-1500: Added scaling strategies for TNF #1396

Uh oh!

Conversation

jaypoulz commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Feb 19, 2025

Uh oh!

openshift-ci-robot commented Feb 19, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaypoulz commented Feb 19, 2025

Uh oh!

openshift-ci-robot commented Feb 19, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaypoulz commented Feb 21, 2025

Uh oh!

jaypoulz commented Feb 25, 2025

Uh oh!

jaypoulz commented Feb 26, 2025

Uh oh!

jaypoulz commented Feb 26, 2025

Uh oh!

jaypoulz commented Feb 26, 2025

Uh oh!

jaypoulz commented Feb 27, 2025

Uh oh!

jaypoulz commented Feb 28, 2025

Uh oh!

jaypoulz commented Feb 28, 2025

Uh oh!

jaypoulz commented Mar 3, 2025

Uh oh!

jaypoulz commented Mar 3, 2025

Uh oh!

jaypoulz commented Mar 5, 2025

Uh oh!

dusk125 commented Mar 5, 2025

Uh oh!

jaypoulz commented Mar 5, 2025

Uh oh!

jaypoulz commented Mar 5, 2025

Uh oh!

jaypoulz commented Mar 5, 2025

Uh oh!

jaypoulz commented Mar 5, 2025

Uh oh!

dusk125 commented Mar 6, 2025

Uh oh!

dusk125 Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

jaypoulz Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

dusk125 Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

jaypoulz Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

jaypoulz Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

jaypoulz Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

dusk125 commented Mar 6, 2025

Uh oh!

openshift-ci bot commented Mar 6, 2025

Uh oh!

openshift-ci-robot commented Mar 6, 2025

Uh oh!

openshift-ci-robot commented Mar 7, 2025

Uh oh!

openshift-ci bot commented Mar 7, 2025

Uh oh!

Uh oh!

jaypoulz commented Feb 19, 2025 •

edited

Loading

openshift-ci-robot commented Feb 19, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Feb 19, 2025 •

edited by openshift-ci bot

Loading