-
Notifications
You must be signed in to change notification settings - Fork 128
OCPBUGS-14914: Set tunnel and server timeouts at backend level #536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@alebedev87: This pull request references Jira Issue OCPBUGS-14914, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
6a1044e
to
ef08ed2
Compare
/jira refresh |
@alebedev87: This pull request references Jira Issue OCPBUGS-14914, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@alebedev87: This pull request references Jira Issue OCPBUGS-14914, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Tested it with 4.15.0-0.ci.test-2023-11-16-093712-ci-ln-dr45512-latest % oc get route % oc annotate route myedge haproxy.router.openshift.io/timeout-tunnel=70m % oc -n openshift-ingress rsh router-default-d78ddc6f-bprk5 sh-4.4$
sh-4.4# wget --no-check-certificate --limit-rate=1000 --delete-after https://myedge-default.apps.ci-ln-9r4m3fk-72292.origin-ci-int-gce.dev.rhcloud.com/oc -k oc.tmp 0%[ ] 239.88K 1000 B/s in 4m 6s --2023-11-17 04:38:15-- (try:20) https://myedge-default.apps.ci-ln-9r4m3fk-72292.origin-ci-int-gce.dev.rhcloud.com/oc oc.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp.tmp 3%[++++++ ] 4.62M 1000 B/s in 3m 55s 2023-11-17 04:42:09 (1000 B/s) - Read error at byte 4847481/121359408 (The TLS connection was non-properly terminated.). Giving up. sh-4.4#
04:41:53.345283 IP (tos 0x0, ttl 62, id 24165, offset 0, flags [DF], proto TCP (6), length 628) |
Back to WIP to address the config warning about missing
|
4113831
to
6096a39
Compare
/test e2e-agnostic |
/test e2e-metal-ipi-ovn-ipv6 |
@alebedev87: This pull request references Jira Issue OCPBUGS-14914, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I still see these errors in the most recent test results:
Did you mean remove the WIP label? |
e165421
to
05a37e0
Compare
Reanimating the PR: start addressing Miciah's review. |
/remove-lifecycle stale |
c1b2a01
to
c67ef49
Compare
High level perf test of Baseline (CI image == 4.17):
Change (image with
|
BENCH_PKGS ?= $(shell \grep -lR '^func Benchmark' | xargs -I {} dirname {} | sed 's|^|./|' | paste -sd ' ') | ||
BENCH_TEST ?= . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind adding a couple of comments about what these are for? I see below that BENCH_PKGS
is a value for flag -benchmem
, so it could use a clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
If I understand correctly, this represents a major change for our users - whatever is the max connect timeout in ANY route becomes the max server and tunnel timeout for all the routes on the backend. I strongly feel like this needs to be something that is configurable, and is off by default in order for past users to see the same behavior when they upgrade to a version that has this fix. |
@candita: 1) only the server timeout is set to the max value on the middle backend, 2) this is only one part of the change. The other part is the server and tunnel timeouts set on all the route backends. Note that the default timeout values are added to the route level timeouts. We should not have any change in the behavior. Because for the routes which don't set any value for the timeouts (in the annotations) the default value will be set on the route backend.
"Configurable" this may put us in the same situation as right now - when a user can set a global (almost) timeout and override the timeouts set on the route level. "Off by default" would lead us to this warning during the config reloads. Overall, I understand the concern. The change may appear to have a potential to hide something I didn't manage to test manually and with our test suites. However I don't see any other solution for this bug. Let me try to organize a discussion about this PR outside of GitHub so that we can come to a conclusion. |
The middle backends (`be_sni`, `be_no_sni`) are updated with the server timeout which is set to the maximum value among all routes from the configuration. This prevents a warning message during config reloads. A benchmark test is added for the template helper function which finds the maximum timeout value among all routes (`maxTimeoutFirstMatchedAndClipped`). A new make target is introduced to run all benchmark tests (`bench`).
c67ef49
to
d70581e
Compare
Squashed the commits. |
/test e2e-metal-ipi-ovn-ipv6 |
@alebedev87: This pull request references Jira Issue OCPBUGS-14914, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
@alebedev87: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
This PR removes the tunnel and server timeouts in the global
default
section and sets them on all the backends. The route annotations for the timeouts continue to work as they did before.As suggested in the upstream issue, the middle backends (
be_sni
andbe_no_sni
) are set with the maximum of all the route timeouts to avoid the following warning message:Test using haproxy-timeout-tunnel repository:
The e2e test: openshift/cluster-ingress-operator#1013.
Presentation which summarizes the issue: link (Red Hat only).