-
Notifications
You must be signed in to change notification settings - Fork 18k
net: localhost pipes cause test hangs on macOS Sierra #25696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
CC: @FiloSottile @agl |
I've looked into this a bit. My best guess is that the OS drops localhost pipes on the floor when it is heavily loaded. You can reproduce the problem by running The tests that fail due to this issue seem to be:
They all have problems with localhost pipes mysteriously failing. There are a few other unrelated failures having to do with
I'll keep digging. |
Hi! I ran into a similar problem here: #26317 (comment)
It hangs even with
The most interesting thing that
|
Sorry, even the -short sometimes hangs. Please try running this:
I killed the hanging tests manually with So it has nothing to do with short/non-short difference. |
Retitled since this is not specific to the crypto/tls tests, and un-assigned because I'm not the domain expert and unfortunately won't have the bandwidth to understand this early enough for 1.11. (Sorry this came so late, it was not on my radar.) |
@mwf I think the behavior you're seeing is unrelated to this issue. I'm going to open a separate issue for it. |
Can we disable the test on this builder until a fix is found? |
There are a bunch of tests that contribute to the problem, each one run individually doesn't fail. So there's no "the test" to be disabled. I guess we could disable several of the heavy pipe using tests and see if that helps. Or perhaps run some of them in "extrashort" mode on 10.12 where they would use fewer pipes. |
Are there any updates for this, it is continuing to fail every build for Darwin 10_12. Perhaps we can assign someone to this issue so we know it is being worked on? |
If all these failing tests are listening on |
This seems related: I can't ping localhost on our 10.12 builders:
And I confirmed that's even on a fresh boot (we boot a new VM per build or gomote session). So it's not like we somehow manage to break localhost. Maybe we have some weird firewall enable on Sierra? |
Doesn't appear to be the Mac firewall...
|
No difference in routing table between 10.11 (working) and 10.12 buildlets:
And nothing jumps out as a notable difference on the interfaces between 10.11 and 10.12:
|
FWIW, this isn't a name resolution bug. It also doesn't work if I ping 127.0.0.1 directly:
|
It's not just localhost. I can create a macOS 10.11 and 10.12 VM, then gomote ssh into 10.11, and from 10.11 ssh into 10.12. But then once I'm in to 10.12, I can't telnet to the exact same IP & port I'd used a second ago from another machine:
So definitely smelling like some firewall thing. |
@bradfitz user-bradfitz-darwin-amd64-10_12-0 has DAD (duplicate address detection) whereas user-bradfitz-darwin-amd64-10_11-0 does not... Potentially VM related? What is the VM running in, xhyve? |
@dotwaffle, VMware. |
I'm also noticing I can't ssh into our Linux management/controller VM in that LAN. Port 22 connects but I don't see the SSH banner, suggesting DNS timeouts. And related: we've been getting test failures about DNS in #27992 |
Are you sure it connects? It sounds like VMWare may be either transparently proxying, or doing NAT for you. With |
What is the |
When I was investigating this, it caused all sorts of problems with other processes on my desktop. For instance, when it ran out of ports Chrome would not be able to connect to new websites. Connections already open in existing tabs would work fine. Sometimes the machine would recover after a few minutes, sometimes I had to reboot to recover it. |
Change https://golang.org/cl/142817 mentions this issue: |
crypto/tls is meant to work over network connections with buffering, not synchronous connections, as explained in #24198. Tests based on net.Pipe are unrealistic as reads and writes are matched one to one. Such tests worked just thanks to the implementation details of the tls.Conn internal buffering, and would break if for example the flush of the first flight of the server was not entirely assimilated by the client rawInput buffer before the client attempted to reply to the ServerHello. Note that this might run into the Darwin network issues at #25696. Fixed a few test races that were either hidden or synchronized by the use of the in-memory net.Pipe. Also, this gets us slightly more realistic benchmarks, reflecting some syscall cost of Read and Write operations. Change-Id: I5a597b3d7a81b8ccc776030cc837133412bf50f8 Reviewed-on: https://go-review.googlesource.com/c/142817 Run-TryBot: Filippo Valsorda <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]>
Pinging the thread again, it looks like the issue (though perhaps resolved for some time) has come back. Looking at the latest darwin-arm64-10_12 builds, they are failing with the same error. |
I don't think anyone submitted a fix for this issue. |
I don't think the error mentioned in #25696 (comment) is the same problem. That is a different timeout running the cmd/go tests. I don't think this error has happened since October 15 (https://build.golang.org/log/9555fcf3f2c2eb12414d36d68a60a010a2680ff0). I don't see any obvious fix in our code, but perhaps the kernel was fixed. Optimistically closing. |
Examples:
https://build.golang.org/log/03f178205193e9987dde8fb7353e357df0cc21af
https://build.golang.org/log/d9074cedf2cea9fd9548dc896aa16a3fcf18ca74
Failure mode:
The text was updated successfully, but these errors were encountered: