- What version of Go are you using (
go version)?
go version go1.6 darwin/amd64
-
What operating system and processor architecture are you using (go env)?
GOARCH="amd64"
GOOS="darwin"
-
What did you do?
Open and close a lot of sockets to localhost in multiple goroutines, writing to the client side of each socket as soon as net.Dial returns.
Runnable example here: https://github.com/tamird/go-conn-repro. This is the same repro case as #14539, but this issue is about the low-level networking problem discovered here rather than the error handling in crypto/tls.
- What did you expect to see?
All connections should succeed.
- What did you see instead?
Sometimes, connections hang and cause the test to time out. With the patch from https://go-review.googlesource.com/#/c/19990/ to return errors properly, we see that net.Conn.Write is returning "socket is not connected", i.e.ENOTCONN.
Analysis:
runtime/netpoll.go may have spurious wakeups, in which a goroutine blocked in pollDesc.WaitWrite may be released even though the socket is not writeable, and likewise for reads. This is normally fine: WaitWrite and WaitRead are used in loops so if the goroutine is woken up too early it just gets EAGAIN from the system call and goes back to sleep. However, there is at least one case when an early wakeup results in an error other than EAGAIN: a socket that has not yet completed its asynchronous connect(2) call will return ENOTCONN for some system calls including getpeername and write. In addition, because the connection has not yet completed, the socket does not yet have an error to be returned by getsockopt(SOL_SOCKET, SO_ERROR), so netFD.connect believes the connection has completed successfully. Once connect and Dial return, the connection is presumably writeable but the first call to Write may fail (this is exacerbated by the faulty error handling in crypto/tls, but is problematic in any case).
My evidence that this in fact happening comes from running the above test case under dtruss. Compare a successful connection:
connect(0x9, 0xC8202BB42C, 0x10) = -1 Err#36
kevent(0x6, 0xC82004ACA8, 0x2) = 0 0
kevent(0x6, 0x0, 0x0) = 1 0
getsockopt(0x9, 0xFFFF, 0x1007) = 0 0
getsockname(0x9, 0xC82004AEE4, 0xC82004AEE0) = 0 0
getpeername(0x9, 0xC82004AEE4, 0xC82004AEE0) = 0 0
setsockopt(0x9, 0x6, 0x1) = 0 0
write(0x9, "\026\003\001\0", 0x84) = 132 0
with an unsuccessful one:
connect(0x7, 0xC82038078C, 0x10) = -1 Err#36
kevent(0x6, 0xC820176CA8, 0x2) = 0 0
write(0x9, "\025\003\003\0", 0x1F) = 31 0
close(0x9) = 0 0
kevent(0x6, 0x0, 0x0) = 0 0
getsockopt(0x7, 0xFFFF, 0x1007) = 0 0
getsockname(0x7, 0xC820176EE4, 0xC820176EE0) = 0 0
getpeername(0x7, 0xC820176EE4, 0xC820176EE0) = -1 Err#57
setsockopt(0x7, 0x6, 0x1) = 0 0
write(0x7, "\0", 0x84) = -1 Err#57
(errno translations: 36=EINPROGRESS, 57=ENOTCONN). In the successful case, getsockopt is not called until kevent returns with a new event. In the failing case, there is no call to kevent with non-empty results in between the connect and getsockopt calls, so there must be a lingering wakeup caused by reuse of pollDesc objects that started before this snippet, which unblocks the goroutine and allows it to proceed to call getsockopt.
[netpoll.go] refers to the possibility of getting stale notifications, but says they are handled by the use of the seq field. This appears to be the case for deadlines, but seq does not appear to be consulted on the path between kqueue and waking up the goroutine.
I see two approaches to fixing this: either remove the possibility of spurious wakeups from netpoll.go so that a return from WaitWrite guarantees that the fd became writeable (e.g. check seq or fd when unblocking a waiter), or make netFD.connect tolerant of spurious wakeups by testing for the writeability of the socket before returning (e.g. attempt to Write(nil) and if it returns ENOTCONN go back into the WaitWrite loop).
All my analysis of this bug has been on OSX and I don't know which parts of this may vary on other platforms. This may shed some light on a TODO in tcpsock_posix.go.
go version)?go version go1.6 darwin/amd64What operating system and processor architecture are you using (
go env)?GOARCH="amd64"
GOOS="darwin"
What did you do?
Open and close a lot of sockets to localhost in multiple goroutines, writing to the client side of each socket as soon as
net.Dialreturns.Runnable example here: https://github.com/tamird/go-conn-repro. This is the same repro case as #14539, but this issue is about the low-level networking problem discovered here rather than the error handling in
crypto/tls.All connections should succeed.
Sometimes, connections hang and cause the test to time out. With the patch from https://go-review.googlesource.com/#/c/19990/ to return errors properly, we see that
net.Conn.Writeis returning "socket is not connected", i.e.ENOTCONN.Analysis:
runtime/netpoll.gomay have spurious wakeups, in which a goroutine blocked inpollDesc.WaitWritemay be released even though the socket is not writeable, and likewise for reads. This is normally fine:WaitWriteandWaitReadare used in loops so if the goroutine is woken up too early it just getsEAGAINfrom the system call and goes back to sleep. However, there is at least one case when an early wakeup results in an error other thanEAGAIN: a socket that has not yet completed its asynchronousconnect(2)call will returnENOTCONNfor some system calls includinggetpeernameandwrite. In addition, because the connection has not yet completed, the socket does not yet have an error to be returned bygetsockopt(SOL_SOCKET, SO_ERROR), sonetFD.connectbelieves the connection has completed successfully. OnceconnectandDialreturn, the connection is presumably writeable but the first call toWritemay fail (this is exacerbated by the faulty error handling incrypto/tls, but is problematic in any case).My evidence that this in fact happening comes from running the above test case under
dtruss. Compare a successful connection:with an unsuccessful one:
(errno translations: 36=
EINPROGRESS, 57=ENOTCONN). In the successful case,getsockoptis not called untilkeventreturns with a new event. In the failing case, there is no call tokeventwith non-empty results in between theconnectandgetsockoptcalls, so there must be a lingering wakeup caused by reuse ofpollDescobjects that started before this snippet, which unblocks the goroutine and allows it to proceed to callgetsockopt.[
netpoll.go] refers to the possibility of getting stale notifications, but says they are handled by the use of theseqfield. This appears to be the case for deadlines, butseqdoes not appear to be consulted on the path between kqueue and waking up the goroutine.I see two approaches to fixing this: either remove the possibility of spurious wakeups from
netpoll.goso that a return fromWaitWriteguarantees that the fd became writeable (e.g. checkseqorfdwhen unblocking a waiter), or makenetFD.connecttolerant of spurious wakeups by testing for the writeability of the socket before returning (e.g. attempt toWrite(nil)and if it returnsENOTCONNgo back into theWaitWriteloop).All my analysis of this bug has been on OSX and I don't know which parts of this may vary on other platforms. This may shed some light on a TODO in
tcpsock_posix.go.