Skip to content
This repository was archived by the owner on Apr 30, 2025. It is now read-only.

fix(trace): remove GotConn delay #342

Merged

Conversation

maxmoehl
Copy link
Member

The delay causes a race condition in the go transport that results in a 502 Bad Gateway with:
endpoint_failure (readLoopPeekFailLocked: %!w(<nil>)).

This happens because the transport peeks the first few bytes on the connection and gets some data even though it doesn't expect any. This causes it to go into an error state even though there is no error resulting in the formatting directive to break.

This commit removes the delay and adds a note why we can't do this for now. This will reduce the amount of requests we can retry because the client will send data before we know that the connection is good. After we sent some data we can't be sure that the server hasn't started processing, hence no retry in such cases.

See: https://cloudfoundry.slack.com/archives/C033ALST37V/p1680888356483179
See: golang/go#31259
Resolves: cloudfoundry/routing-release#316

The delay causes a race condition in the go transport that results in a
502 Bad Gateway with:
  `endpoint_failure (readLoopPeekFailLocked: %!w(<nil>))`.

This happens because the transport peeks the first few bytes on the
connection and gets some data even though it doesn't expect any. This
causes it to go into an error state even though there is no error
resulting in the formatting directive to break.

This commit removes the delay and adds a note why we can't do this for
now. This will reduce the amount of requests we can retry because the
client will send data before we know that the connection is good. After
we sent _some_ data we can't be sure that the server hasn't started
processing, hence no retry in such cases.

See: https://cloudfoundry.slack.com/archives/C033ALST37V/p1680888356483179
See: golang/go#31259
Resolves: cloudfoundry/routing-release#316
@domdom82
Copy link
Contributor

technically, the race condition is already there. the delay only makes sure it is predictable by allowing one go routine to be scheduled before the competing one. so the delay does cause the race. Also bear in mind that even without the delay problems are not solved and the issue raised by @Gerg may still appear. it's just more random now.

@geofffranks geofffranks merged commit 4b605dd into cloudfoundry:main Apr 17, 2023
@maxmoehl maxmoehl deleted the remove-gorouter-dial-delay branch April 18, 2023 06:12
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Delay introduced in GotConn breaks CATS
3 participants