-
Notifications
You must be signed in to change notification settings - Fork 4.7k
OCPBUGS-61063: test/extended/cli/adm_upgrade/recommend: Enable precheck and accept #30113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OCPBUGS-61063: test/extended/cli/adm_upgrade/recommend: Enable precheck and accept #30113
Conversation
Job Failure Risk Analysis for sha: fcebaa6
|
ae9e67d
to
b18aac0
Compare
Job Failure Risk Analysis for sha: b18aac0
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: b18aac0
New tests seen in this PR at sha: b18aac0
|
34ec4d1
to
2a74ae2
Compare
Job Failure Risk Analysis for sha: 2a74ae2
|
2a74ae2
to
8924956
Compare
Job Failure Risk Analysis for sha: 8924956
|
931699e
to
d7501ee
Compare
Job Failure Risk Analysis for sha: d7501ee
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: d7501ee
New tests seen in this PR at sha: d7501ee
|
3d796c1
to
6f8ca39
Compare
Job Failure Risk Analysis for sha: 6f8ca39
|
6f8ca39
to
4cbd5fe
Compare
Job Failure Risk Analysis for sha: 4cbd5fe
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: 4cbd5fe
New tests seen in this PR at sha: 4cbd5fe
|
4cbd5fe
to
b9b2c30
Compare
Job Failure Risk Analysis for sha: b9b2c30
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: b9b2c30
New tests seen in this PR at sha: b9b2c30
|
b9b2c30
to
ccb493b
Compare
Job Failure Risk Analysis for sha: ccb493b
|
ccb493b
to
eefe2bb
Compare
Job Failure Risk Analysis for sha: eefe2bb
Showing 20 of 40 jobs analysis |
eefe2bb
to
93f5c58
Compare
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: 93f5c58
New tests seen in this PR at sha: 93f5c58
|
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial 4 |
@wking: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e0a1da90-89cd-11f0-8eef-a92312a06c06-0 |
Continuing to run up additional serial numbers, to see how reliable the new code is. Sticking with the aggregate-of-four batching, to avoid swamping CI-infra capacity, but also not waiting for the previous round of payload runs to wrap up, as I try to balance speedy-sample-size-growth against infra-load: /payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-serial 4 |
@wking: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c4dd0710-89de-11f0-944e-baf715a5082d-0 |
/retest-required |
…False In 44cd78a (test/extended/cli/adm_upgrade/recommend: Account for Upgradeable=False, 2025-08-05, openshift#30113), I'd updated the regular expression to accept: Reason: MultipleReasons for clusters that had both Upgradeable=False and conditional risks going on. In 7724a75 (test/extended/cli/adm_upgrade/recommend: Trust the ingress CA, 2025-08-14, openshift#30113), I extended the regular expression to cover: Reason: accepted TestRiskA via ConditionalUpdateRisk But I hadn't thought through the MultipleReasons case, and this commit catches us up to tech-preview serial output like [1]: ... Reason: accepted MultipleReasons via ConditionalUpdateRisk Message: Cluster operator config-operator should not be upgraded between minor versions: FeatureGatesUpgradeable: "TechPreviewNoUpgrade" does not allow updates This is a test risk. https://example.com/testRiskA ... [1]: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/30113/pull-ci-openshift-origin-main-e2e-gcp-ovn-techpreview-serial-1of2/1963672599955247104
Job Failure Risk Analysis for sha: c68be5e
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, leaving a note but not a blocker
defaultIngressSecretName, err := oc.Run("get").Args("--namespace=openshift-ingress-operator", "-o", "jsonpath={.spec.defaultCertificate.name}", "ingresscontroller.operator.openshift.io", "default").Output() | ||
if err != nil { | ||
return "", err | ||
} | ||
|
||
if defaultIngressSecretName == "" { | ||
defaultIngressSecretName = "router-certs-default" | ||
} | ||
|
||
ingressNamespace := "openshift-ingress" | ||
defaultIngressCert, err := oc.Run("extract").Args("--namespace", ingressNamespace, fmt.Sprintf("secret/%s", defaultIngressSecretName), "--keys=tls.crt", "--to=-").Output() | ||
if err != nil { | ||
return "", err | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe given my push for "let's stop using oc
to test CVO" in threads like this I think I'll need to voice here as well; do these two need to be oc
calls? I think that at least the first get
would be more appropriate to do thought client-go and get a typed struct instead of reading elements with jsonpath of --keys
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they don't need to be oc
, but given that we're in the test/extended/cli
section, using oc
seemed ok. I'm not strongly opinionated for this setup though; they could certainly be ported to the structured client-go if they were going to be used outside of oc
CLI testing.
A few serial jobs failed, but all on either infra or other test-cases, not on anything related to my changes. /retest-required |
Both /verified by examining all presubmits and /payload jobs, no failures identified for these tests |
@wking: This PR has been marked as verified by In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/lgtm /hold for @petr-muller to check the reply above. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hongkailiu, petr-muller, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Job Failure Risk Analysis for sha: c68be5e
|
|
||
out, err := oc.Run("--certificate-authority", caBundleFilePath, "adm", "upgrade", "recommend", "--version", fmt.Sprintf("4.%d.0", currentVersion.Minor+1), "--accept", "ConditionalUpdateRisk").EnvVar("OC_ENABLE_CMD_UPGRADE_RECOMMEND", "true").EnvVar("OC_ENABLE_CMD_UPGRADE_RECOMMEND_PRECHECK", "true").EnvVar("OC_ENABLE_CMD_UPGRADE_RECOMMEND_ACCEPT", "true").Output() | ||
|
||
o.Expect(err).NotTo(o.HaveOccurred()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hrm, e2e-aws-ovn-single-node-serial
failed:
: [Serial][sig-cli] oc adm upgrade recommend When the update service has conditional recommendations runs successfully with conditional recommendations to the --version target [Suite:openshift/conformance/serial] 25s
{ fail [github.com/openshift/origin/test/extended/cli/adm_upgrade/recommend.go:245]: Unexpected error:
...
Failing=True:
Reason: ClusterOperatorNotAvailable
Message: Cluster operator authentication is not available
...
error: issues that apply to this cluster but which were not included in --accept: Failing
The test-case that covers auth availability flaked:
: [bz-apiserver-auth] clusteroperator/authentication should not change condition/Available
...15 unwelcome but acceptable clusteroperator state transitions during e2e test run...
...exception: https://issues.redhat.com/browse/OCPBUGS-20056...
Working in an exception might be tricky regular-expression handling. Trying to gauge how frequent this issue is:
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-single-node-serial 20
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of the aggregation run's 20 attempts:
- 9 passed.
- three failed the
--version
test-case onauthentication
Available=False
. - two failed on build-cluster registry 500s.
- two failed on
prometheus-operator
watch requests. - one failed on a
context deadline exceeded
out ofrunUpdateService
, withFailed to pull image...authentication required
issues trying to get thetools
image. - one failed on authentication Pod restarts.
- one failed on an un-excepted
authentication
Available=False
(OAuthServerDeployment_NoPod
). - one failed on an unexpected successful return in an TestImageStreamTagsAdmission test-case.
So the auth functionality is pretty flaky in these single-node clusters under serial-suite load. I've pushed c68be5e -> 78ac50f, to hopefully soften this test-case enough to not trip over this issue.
/retest-required |
Maybe this can't be in a threaded comment? /payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-single-node-serial 20 |
@wking: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e6b5d560-8a92-11f0-8a7f-9627853831e6-0 |
A depressing amount of the time in single-node testing, the authentication ClusterOperator is Available=False, and causes the 'recommend' command to error if the 'Failing' risk is not accepted. For example, in this aggregation run's 20 attempts [1]: * 9 passed. * three failed the '--version' test-case on 'authentication' 'Available=False' [2,3,4]. * two failed on build-cluster registry 500s [5,6]. * two failed on 'prometheus-operator' watch requests [7,8]. * one failed on a 'context deadline exceeded' out of 'runUpdateService' [9], with 'Failed to pull image...authentication required' issues trying to get the 'tools' image. * one failed on authentication Pod restarts [10]. * one failed on an un-excepted 'authentication' 'Available=False' ('OAuthServerDeployment_NoPod') [11]. * one failed on an unexpected successful return in an TestImageStreamTagsAdmission test-case [12]. So the auth functionality is pretty flaky in these single-node clusters under serial-suite load. Ideally [13] can get addressed or the auth component can otherwise get firmed up, but until then, this commit softens our logic to allow that kind of ClusterOperator Available=False (which gets bubbled up as ClusterVersion Failing=True), without failing our new test-case. [1]: openshift#30113 (comment) [2]: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-30113-nightly-4.21-e2e-aws-ovn-single-node-serial/1964055948225941504 [3]: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-30113-nightly-4.21-e2e-aws-ovn-single-node-serial/1964055951426195456 [4]: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-30113-nightly-4.21-e2e-aws-ovn-single-node-serial/1964055945973600256 [5]: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-30113-nightly-4.21-e2e-aws-ovn-single-node-serial/1964055945063436288#1:build-log.txt%3A32 [6]: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-30113-nightly-4.21-e2e-aws-ovn-single-node-serial/1964055945512226816#1:build-log.txt%3A43 [7]: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-30113-nightly-4.21-e2e-aws-ovn-single-node-serial/1964055950948044800 [8]: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-30113-nightly-4.21-e2e-aws-ovn-single-node-serial/1964055948683120640 [9]: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-30113-nightly-4.21-e2e-aws-ovn-single-node-serial/1964055952327970816 [10]: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-30113-nightly-4.21-e2e-aws-ovn-single-node-serial/1964055953233940480 [11]: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-30113-nightly-4.21-e2e-aws-ovn-single-node-serial/1964055947772956672 [12]: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-origin-30113-nightly-4.21-e2e-aws-ovn-single-node-serial/1964055952797732864 [13]: https://issues.redhat.com/browse/OCPBUGS-20056
New changes are detected. LGTM label has been removed. |
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Job Failure Risk Analysis for sha: 78ac50f
|
/payload-aggregate periodic-ci-openshift-release-master-nightly-4.21-e2e-aws-ovn-single-node-serial 20 |
@wking: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6bc15d30-8b59-11f0-958c-09916b4ce3f0-0 |
These will probably not pass without more work, but checking to see how far away we are.