-
Notifications
You must be signed in to change notification settings - Fork 143
OCPEDGE-1500: Added scaling strategies for TNF #1396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPEDGE-1500: Added scaling strategies for TNF #1396
Conversation
@jaypoulz: This pull request references Jira Issue OCPBUGS-1500, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@jaypoulz: This pull request references OCPEDGE-1500 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@jaypoulz: This pull request references OCPEDGE-1500 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
f6bf5d8
to
9cc6039
Compare
/retest-required |
9cc6039
to
e489d1f
Compare
/retest-required |
584a5fc
to
746a469
Compare
/retest-required |
/retest |
2 similar comments
/retest |
/retest |
/retest-required |
/retest |
/restest-required |
/retest-required |
So the e2e tests are real failures - not connected to the CEO code, but rather the API bump. |
@jaypoulz now that the API has merged, can you update the go.mod to consume it (not from Egli's repo?) I can start reviewing and debugging from there. |
746a469
to
706ad9b
Compare
Here's the updated API :) |
After enabling the featuregates, I've been able to get all of the e2e tests to pass locally. |
/retest-required |
Based on my local testing, I'm confident there isn't anything in my code that is affecting the CI jobs. Sounding through another round of retests because I'm confident they'll have a good chance at passing. |
/retest-required |
pkg/cmd/render/render.go
Outdated
// The only code that references the scaling strategy written to the | ||
// manifests lives in the 00_etcd-endpoints-cm.yaml and etcd-member-pod.yaml | ||
// templates, which just check for BootstrapInPlaceStrategy (see logic above). | ||
case cpReplicaCount == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be a case here where the delayed HA strategy is not being returned properly as previously it was returned immediately. Where here it's possible for one of the other paths to execute first (for non-2no clusters)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is definitely some risk introduced by this.
The test cases that should cover this are:
[new] Testing DelayedHA with Arbiter, 2 CP nodes
[old] Delayed HA with Standard HA, 3 CP nodes
The only other relevant cases would be delayed annotation with 1 CP node, which I don't think we have a test case for. What is the expected response for that? In my mind, single node should always respond BootstrapInPlace place if that's set or UnsafeScalingStrategy if not, but the old logic would short circuit and respond with DelayedHA.
Anything with more than 3 CP nodes should be the same as the the [old] DelayedHA test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could alleviate the risk by moving the ==1 case to after the delayed case since the other cases are specifically for 2NO.
That should result in the same default behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another important node is the comment I've made above this block. I don't think the scaling strategy specified by the render function is actually used anywhere (besides the detection of the bootstrap scaling strategy).
The logic using the scaling strategy (e.g. CheckSafeToScaleCluster in bootstrap.go and sync in bootstrap_teardown_controller.go), call the GetBootstrapScalingStrategy function in ceohelpers/bootstrap.go.
If you look at that file, you'll notice that BootstrapScalingStrategy isn't even an option that this function returns.
The render.go scaling strategy function is only used by render.go => Run() => newTemplateData(r)
This template data is only referenced when creating manifests - and specifically, it's only there for injecting the bootstrap IP:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, #547 would have been better off not introducing the exception for BootstrapInPlace as a scaling strategy since it's not solving the same problems as the others.
The simpler way would have been to expose a BootstrapMode to the templateData, which is only set to InPlace when the annotation is set. Then you could drop all logic related to the ScalingStrategy from render.go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed a new version that incorporates the suggestion in #1396 (comment).
…ncing (TNF) - TwoNodeScalingStrategy is for core installer installations fo Two Node OpenShift with Fencing - DelayedTwoNodeScalingStrategy is for assisted installs of Two Node OpenShift with Fencing Full change list: - Added updated with bootstrap scaling logic for TNF - Updated manifests render function to know about new topologies - Added a comment to health.go to explain why there isn't a TNF override directly in the quorum fault tolerance check function - Updated bootstrap teardown logic to handle TNF cases - Fixed quorum_check_test to check for the updated error message - Removed the single node topology check utility function - Enabled defragcontroller for Two Node OpenShift with Fencing and Two Node OpenShift with Arbiter
706ad9b
to
c071422
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dusk125, jaypoulz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
1 similar comment
@jaypoulz: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
ba5c2e8
into
openshift:main
[ART PR BUILD NOTIFIER] Distgit: cluster-etcd-operator |
This PR updates the scaling strategy options available in CEO to provide options that are compatible with TNF.
This PR depends on openshift/api#2196