-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime, net: spurious wakeups in netpoll using kevent #14548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
From my experience in #13853, I suppose that this is a darwin-specific issue. You can write a small dtrace script and see more details. I'm not sure whether this issue occurs when using external connectivity, not using loopback. I guess it's worth to try external connectivity. |
I agree that this is likely darwin-specific. On linux, I believe spurious wakeups are possible, but they are relatively harmless because Spurious wakeups are also more common on |
I don't know the reason why you suppose so. I've just scratched the TCP control block inside the kernel by using the following snippet and observed that the kernel poked with the application even when the state of TCP was "syn-sent." Furthermore, it happened without any notification through ev.flags such as EV_EOF, EV_ERROR.
|
FWIW, my dtrace script is the following:
|
@bdarnell, I'm happy if you have a spare time for investigating this issue. |
The reason is in the two The goroutine in |
Um, I'm still not sure because there are two types of kevent calls; one is for initial registration to the single per-process kqueue and another is for capturing kernel events, and I don't know whether dtruss has good tracing functionality than dtrace. Anyway, when I tweak the dtrace script mentioned above like the following:
we can see
Perhaps there's a potential race condition on the kqueue implementation for Darwin, like #14127. Summary:
|
CL https://golang.org/cl/20468 mentions this issue. |
go version
)?go version go1.6 darwin/amd64
What operating system and processor architecture are you using (
go env
)?GOARCH="amd64"
GOOS="darwin"
What did you do?
Open and close a lot of sockets to localhost in multiple goroutines, writing to the client side of each socket as soon as
net.Dial
returns.Runnable example here: https://github.com/tamird/go-conn-repro. This is the same repro case as #14539, but this issue is about the low-level networking problem discovered here rather than the error handling in
crypto/tls
.All connections should succeed.
Sometimes, connections hang and cause the test to time out. With the patch from https://go-review.googlesource.com/#/c/19990/ to return errors properly, we see that
net.Conn.Write
is returning "socket is not connected", i.e.ENOTCONN
.Analysis:
runtime/netpoll.go
may have spurious wakeups, in which a goroutine blocked inpollDesc.WaitWrite
may be released even though the socket is not writeable, and likewise for reads. This is normally fine:WaitWrite
andWaitRead
are used in loops so if the goroutine is woken up too early it just getsEAGAIN
from the system call and goes back to sleep. However, there is at least one case when an early wakeup results in an error other thanEAGAIN
: a socket that has not yet completed its asynchronousconnect(2)
call will returnENOTCONN
for some system calls includinggetpeername
andwrite
. In addition, because the connection has not yet completed, the socket does not yet have an error to be returned bygetsockopt(SOL_SOCKET, SO_ERROR)
, sonetFD.connect
believes the connection has completed successfully. Onceconnect
andDial
return, the connection is presumably writeable but the first call toWrite
may fail (this is exacerbated by the faulty error handling incrypto/tls
, but is problematic in any case).My evidence that this in fact happening comes from running the above test case under
dtruss
. Compare a successful connection:with an unsuccessful one:
(errno translations: 36=
EINPROGRESS
, 57=ENOTCONN
). In the successful case,getsockopt
is not called untilkevent
returns with a new event. In the failing case, there is no call tokevent
with non-empty results in between theconnect
andgetsockopt
calls, so there must be a lingering wakeup caused by reuse ofpollDesc
objects that started before this snippet, which unblocks the goroutine and allows it to proceed to callgetsockopt
.[
netpoll.go
] refers to the possibility of getting stale notifications, but says they are handled by the use of theseq
field. This appears to be the case for deadlines, butseq
does not appear to be consulted on the path between kqueue and waking up the goroutine.I see two approaches to fixing this: either remove the possibility of spurious wakeups from
netpoll.go
so that a return fromWaitWrite
guarantees that the fd became writeable (e.g. checkseq
orfd
when unblocking a waiter), or makenetFD.connect
tolerant of spurious wakeups by testing for the writeability of the socket before returning (e.g. attempt toWrite(nil)
and if it returnsENOTCONN
go back into theWaitWrite
loop).All my analysis of this bug has been on OSX and I don't know which parts of this may vary on other platforms. This may shed some light on a TODO in
tcpsock_posix.go
.The text was updated successfully, but these errors were encountered: