Skip to content

Conversation

jaypoulz
Copy link
Contributor

@jaypoulz jaypoulz commented Feb 19, 2025

This PR updates the scaling strategy options available in CEO to provide options that are compatible with TNF.
This PR depends on openshift/api#2196

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 19, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Feb 19, 2025
@openshift-ci-robot
Copy link

@jaypoulz: This pull request references Jira Issue OCPBUGS-1500, which is invalid:

  • expected the bug to be open, but it isn't
  • expected the bug to target either version "4.19." or "openshift-4.19.", but it targets "4.11.z" instead
  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is Closed (Won't Do) instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Feb 19, 2025
@openshift-ci openshift-ci bot requested review from Elbehery and tjungblu February 19, 2025 22:15
@jaypoulz jaypoulz changed the title WIP: OCPBUGS-1500: Added scaling strategies for TNF WIP: OCPEDGE-1500: Added scaling strategies for TNF Feb 19, 2025
@openshift-ci-robot openshift-ci-robot removed jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Feb 19, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 19, 2025

@jaypoulz: This pull request references OCPEDGE-1500 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

This PR updates the scaling strategy options available in CEO to provide options that are compatible with TNF.
This PR depends on openshift/api#2196

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jaypoulz
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 19, 2025

@jaypoulz: This pull request references OCPEDGE-1500 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jaypoulz jaypoulz force-pushed the OCPBUGS-1500-tns-scaling-strategies branch 2 times, most recently from f6bf5d8 to 9cc6039 Compare February 20, 2025 22:08
@jaypoulz
Copy link
Contributor Author

/retest-required

@jaypoulz jaypoulz force-pushed the OCPBUGS-1500-tns-scaling-strategies branch from 9cc6039 to e489d1f Compare February 24, 2025 17:04
@jaypoulz jaypoulz changed the title WIP: OCPEDGE-1500: Added scaling strategies for TNF OCPEDGE-1500: Added scaling strategies for TNF Feb 24, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 24, 2025
@jaypoulz
Copy link
Contributor Author

/retest-required

@jaypoulz jaypoulz force-pushed the OCPBUGS-1500-tns-scaling-strategies branch 3 times, most recently from 584a5fc to 746a469 Compare February 25, 2025 20:59
@jaypoulz
Copy link
Contributor Author

/retest-required

@jaypoulz
Copy link
Contributor Author

/retest

2 similar comments
@jaypoulz
Copy link
Contributor Author

/retest

@jaypoulz
Copy link
Contributor Author

/retest

@jaypoulz
Copy link
Contributor Author

/retest-required

@jaypoulz
Copy link
Contributor Author

/retest

@jaypoulz
Copy link
Contributor Author

jaypoulz commented Mar 3, 2025

/restest-required

@jaypoulz
Copy link
Contributor Author

jaypoulz commented Mar 3, 2025

/retest-required

@jaypoulz
Copy link
Contributor Author

jaypoulz commented Mar 5, 2025

So the e2e tests are real failures - not connected to the CEO code, but rather the API bump.
I don't think the changes I specifically rely on in the API are relevant to the broken tests, but I think we'll need a holistic review of the API updates if we're going to get this merged.

@dusk125
Copy link
Contributor

dusk125 commented Mar 5, 2025

@jaypoulz now that the API has merged, can you update the go.mod to consume it (not from Egli's repo?) I can start reviewing and debugging from there.

@jaypoulz jaypoulz force-pushed the OCPBUGS-1500-tns-scaling-strategies branch from 746a469 to 706ad9b Compare March 5, 2025 16:18
@jaypoulz
Copy link
Contributor Author

jaypoulz commented Mar 5, 2025

Here's the updated API :)
I've been debugging locally, and I discovered that part of my issue was that my cluster didn't have the backups featuregate enabled. Trying to work through the e2e tests one at a time. :)

@jaypoulz
Copy link
Contributor Author

jaypoulz commented Mar 5, 2025

After enabling the featuregates, I've been able to get all of the e2e tests to pass locally.
I'm not convinced that the operator test failures are real. The test run for ci/prow/e2e-aws-ovn-etcd-scaling hit unrelated issues.

@jaypoulz
Copy link
Contributor Author

jaypoulz commented Mar 5, 2025

/retest-required

@jaypoulz
Copy link
Contributor Author

jaypoulz commented Mar 5, 2025

Based on my local testing, I'm confident there isn't anything in my code that is affecting the CI jobs. Sounding through another round of retests because I'm confident they'll have a good chance at passing.

@dusk125
Copy link
Contributor

dusk125 commented Mar 6, 2025

/retest-required

// The only code that references the scaling strategy written to the
// manifests lives in the 00_etcd-endpoints-cm.yaml and etcd-member-pod.yaml
// templates, which just check for BootstrapInPlaceStrategy (see logic above).
case cpReplicaCount == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a case here where the delayed HA strategy is not being returned properly as previously it was returned immediately. Where here it's possible for one of the other paths to execute first (for non-2no clusters)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is definitely some risk introduced by this.
The test cases that should cover this are:
[new] Testing DelayedHA with Arbiter, 2 CP nodes
[old] Delayed HA with Standard HA, 3 CP nodes

The only other relevant cases would be delayed annotation with 1 CP node, which I don't think we have a test case for. What is the expected response for that? In my mind, single node should always respond BootstrapInPlace place if that's set or UnsafeScalingStrategy if not, but the old logic would short circuit and respond with DelayedHA.

Anything with more than 3 CP nodes should be the same as the the [old] DelayedHA test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could alleviate the risk by moving the ==1 case to after the delayed case since the other cases are specifically for 2NO.
That should result in the same default behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another important node is the comment I've made above this block. I don't think the scaling strategy specified by the render function is actually used anywhere (besides the detection of the bootstrap scaling strategy).

The logic using the scaling strategy (e.g. CheckSafeToScaleCluster in bootstrap.go and sync in bootstrap_teardown_controller.go), call the GetBootstrapScalingStrategy function in ceohelpers/bootstrap.go.
If you look at that file, you'll notice that BootstrapScalingStrategy isn't even an option that this function returns.

The render.go scaling strategy function is only used by render.go => Run() => newTemplateData(r)
This template data is only referenced when creating manifests - and specifically, it's only there for injecting the bootstrap IP:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, #547 would have been better off not introducing the exception for BootstrapInPlace as a scaling strategy since it's not solving the same problems as the others.

The simpler way would have been to expose a BootstrapMode to the templateData, which is only set to InPlace when the annotation is set. Then you could drop all logic related to the ScalingStrategy from render.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a new version that incorporates the suggestion in #1396 (comment).

…ncing (TNF)

- TwoNodeScalingStrategy is for core installer installations fo Two Node OpenShift with Fencing
- DelayedTwoNodeScalingStrategy is for assisted installs of Two Node OpenShift with Fencing

Full change list:
- Added updated with bootstrap scaling logic for TNF
- Updated manifests render function to know about new topologies
- Added a comment to health.go to explain why there isn't a TNF override directly in the quorum fault tolerance check function
- Updated bootstrap teardown logic to handle TNF cases
- Fixed quorum_check_test to check for the updated error message
- Removed the single node topology check utility function
- Enabled defragcontroller for Two Node OpenShift with Fencing and Two Node OpenShift with Arbiter
@jaypoulz jaypoulz force-pushed the OCPBUGS-1500-tns-scaling-strategies branch from 706ad9b to c071422 Compare March 6, 2025 19:09
@dusk125
Copy link
Contributor

dusk125 commented Mar 6, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 6, 2025
Copy link
Contributor

openshift-ci bot commented Mar 6, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dusk125, jaypoulz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2025
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 5bbe494 and 2 for PR HEAD c071422 in total

1 similar comment
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 5bbe494 and 2 for PR HEAD c071422 in total

Copy link
Contributor

openshift-ci bot commented Mar 7, 2025

@jaypoulz: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-disruptive c071422 link false /test e2e-aws-disruptive
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown c071422 link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/e2e-gcp-ovn-etcd-scaling c071422 link false /test e2e-gcp-ovn-etcd-scaling
ci/prow/e2e-gcp-disruptive c071422 link false /test e2e-gcp-disruptive
ci/prow/e2e-aws-etcd-certrotation c071422 link false /test e2e-aws-etcd-certrotation
ci/prow/e2e-aws-etcd-recovery c071422 link false /test e2e-aws-etcd-recovery
ci/prow/configmap-scale c071422 link false /test configmap-scale
ci/prow/e2e-gcp-disruptive-ovn c071422 link false /test e2e-gcp-disruptive-ovn
ci/prow/e2e-vsphere-ovn-etcd-scaling c071422 link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-azure-ovn-etcd-scaling c071422 link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-azure c071422 link false /test e2e-azure
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown c071422 link false /test e2e-metal-ovn-sno-cert-rotation-shutdown
ci/prow/e2e-aws-disruptive-ovn c071422 link false /test e2e-aws-disruptive-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit ba5c2e8 into openshift:main Mar 7, 2025
20 of 33 checks passed
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: cluster-etcd-operator
This PR has been included in build cluster-etcd-operator-container-v4.19.0-202503072239.p0.gba5c2e8.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants