Skip to content

stress: failed test in cockroach/rpc/rpc.test: TestRemoteOffsetUnhealthy #8136

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cockroach-teamcity opened this issue Jul 30, 2016 · 7 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

Binary: cockroach/static-tests.tar.gz sha: https://github.com/cockroachdb/cockroach/commits/20408fd8bbf7a6e45c28eba10130be9bf9614310

Stress build found a failed test:

=== RUN   TestRemoteOffsetUnhealthy
I160730 04:40:36.077259 http2_server.go:268  transport: http2Server.HandleStreams failed to read frame: EOF
I160730 04:40:36.077316 http2_server.go:268  transport: http2Server.HandleStreams failed to read frame: EOF
I160730 04:40:36.077333 http2_server.go:268  transport: http2Server.HandleStreams failed to read frame: EOF
I160730 04:40:36.077399 http2_server.go:268  transport: http2Server.HandleStreams failed to read frame: EOF
I160730 04:40:36.077421 http2_server.go:268  transport: http2Server.HandleStreams failed to read frame: EOF
I160730 04:40:36.077429 http2_server.go:268  transport: http2Server.HandleStreams failed to read frame: EOF
I160730 04:40:36.077442 http2_server.go:268  transport: http2Server.HandleStreams failed to read frame: EOF
I160730 04:40:36.077476 http2_server.go:268  transport: http2Server.HandleStreams failed to read frame: EOF
I160730 04:40:36.077500 http2_server.go:268  transport: http2Server.HandleStreams failed to read frame: EOF
I160730 04:40:36.077517 http2_server.go:268  transport: http2Server.HandleStreams failed to read frame: EOF
I160730 04:40:36.077550 http2_server.go:268  transport: http2Server.HandleStreams failed to read frame: EOF
I160730 04:40:36.077565 http2_server.go:268  transport: http2Server.HandleStreams failed to read frame: EOF
--- FAIL: TestRemoteOffsetUnhealthy (5.40s)
    context_test.go:387: max offset: 100ms - node 0 with acceptable clock offset of 0 did not return an error, as expected
    context_test.go:387: max offset: 100ms - node 1 with acceptable clock offset of 0 did not return an error, as expected
    context_test.go:387: max offset: 100ms - node 2 with acceptable clock offset of 0 did not return an error, as expected
    context_test.go:379: max offset: 100ms - node 3 with excessive clock offset of 100.000001ms returned expected error: fewer than half the known nodes are within the maximum offset of 100ms (0 of 3)
    leaktest.go:92: Leaked goroutine: goroutine 281 [chan receive]:
        google.golang.org/grpc/credentials.(*tlsCreds).ClientHandshake(0xc8200843a0, 0xc8202b5120, 0xf, 0x7fd2a1c71c68, 0xc8200ba040, 0x4a813f7fa, 0x0, 0x0, 0x0, 0x0, ...)
            /go/src/google.golang.org/grpc/credentials/credentials.go:170 +0x3b5
        google.golang.org/grpc/transport.newHTTP2Client(0xc8202b5120, 0xf, 0x0, 0x0, 0xc8202fe240, 0x0, 0x0, 0x0, 0x0, 0x7fd2a1c706b0, ...)
            /go/src/google.golang.org/grpc/transport/http2_client.go:134 +0x3ac
        google.golang.org/grpc/transport.NewClientTransport(0xc8202b5120, 0xf, 0x0, 0x0, 0xc8202fe240, 0x0, 0x0, 0x0, 0x0, 0x7fd2a1c706b0, ...)
            /go/src/google.golang.org/grpc/transport/transport.go:373 +0x6e
        google.golang.org/grpc.(*addrConn).resetTransport(0xc82075e000, 0x11e8200, 0x0, 0x0)
            /go/src/google.golang.org/grpc/clientconn.go:568 +0x42f
        google.golang.org/grpc.(*ClientConn).newAddrConn.func1(0xc82075e000)
            /go/src/google.golang.org/grpc/clientconn.go:415 +0x2a
        created by google.golang.org/grpc.(*ClientConn).newAddrConn
            /go/src/google.golang.org/grpc/clientconn.go:421 +0x51c
    leaktest.go:92: Leaked goroutine: goroutine 291 [IO wait]:
        net.runtime_pollWait(0x7fd2a1c70890, 0x72, 0xc8206b0000)
            /usr/local/go/src/runtime/netpoll.go:160 +0x60
        net.(*pollDesc).Wait(0xc8202c2220, 0x72, 0x0, 0x0)
            /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
        net.(*pollDesc).WaitRead(0xc8202c2220, 0x0, 0x0)
            /usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
        net.(*netFD).Read(0xc8202c21c0, 0xc8206b0000, 0x400, 0x400, 0x0, 0x7fd2a1c65028, 0xc82007a080)
            /usr/local/go/src/net/fd_unix.go:250 +0x23a
        net.(*conn).Read(0xc8200ba040, 0xc8206b0000, 0x400, 0x400, 0x0, 0x0, 0x0)
            /usr/local/go/src/net/net.go:172 +0xe4
        crypto/tls.(*block).readFromUntil(0xc820304000, 0x7fd2a1c71cc8, 0xc8200ba040, 0x5, 0x0, 0x0)
            /usr/local/go/src/crypto/tls/conn.go:460 +0xcc
        crypto/tls.(*Conn).readRecord(0xc820281b00, 0x14, 0x0, 0x0)
            /usr/local/go/src/crypto/tls/conn.go:562 +0x2d1
        crypto/tls.(*clientHandshakeState).readFinished(0xc820735e60, 0x0, 0x0, 0x0, 0x0, 0x0)
            /usr/local/go/src/crypto/tls/handshake_client.go:566 +0x55
        crypto/tls.(*Conn).clientHandshake(0xc820281b00, 0x0, 0x0)
            /usr/local/go/src/crypto/tls/handshake_client.go:225 +0x1923
        crypto/tls.(*Conn).Handshake(0xc820281b00, 0x0, 0x0)
            /usr/local/go/src/crypto/tls/conn.go:1033 +0xec
        google.golang.org/grpc/credentials.(*tlsCreds).ClientHandshake.func2(0xc8200760c0, 0xc820281b00)
            /go/src/google.golang.org/grpc/credentials/credentials.go:168 +0x21
        created by google.golang.org/grpc/credentials.(*tlsCreds).ClientHandshake
            /go/src/google.golang.org/grpc/credentials/credentials.go:169 +0x385
    leaktest.go:92: Leaked goroutine: goroutine 340 [IO wait]:
        net.runtime_pollWait(0x7fd2a1c70d10, 0x72, 0xc8206b1800)
            /usr/local/go/src/runtime/netpoll.go:160 +0x60
        net.(*pollDesc).Wait(0xc8200a05a0, 0x72, 0x0, 0x0)
            /usr/local/go/src/net/fd_poll_runtime.go:73 +0x3a
        net.(*pollDesc).WaitRead(0xc8200a05a0, 0x0, 0x0)
            /usr/local/go/src/net/fd_poll_runtime.go:78 +0x36
        net.(*netFD).Read(0xc8200a0540, 0xc8206b1800, 0x400, 0x400, 0x0, 0x7fd2a1c65028, 0xc82007a080)
            /usr/local/go/src/net/fd_unix.go:250 +0x23a
        net.(*conn).Read(0xc82015c0f0, 0xc8206b1800, 0x400, 0x400, 0x0, 0x0, 0x0)
            /usr/local/go/src/net/net.go:172 +0xe4
        crypto/tls.(*block).readFromUntil(0xc820140990, 0x7fd2a1c71cc8, 0xc82015c0f0, 0x5, 0x0, 0x0)
            /usr/local/go/src/crypto/tls/conn.go:460 +0xcc
        crypto/tls.(*Conn).readRecord(0xc82054b800, 0x16, 0x0, 0x0)
            /usr/local/go/src/crypto/tls/conn.go:562 +0x2d1
        crypto/tls.(*Conn).readHandshake(0xc82054b800, 0x0, 0x0, 0x0, 0x0)
            /usr/local/go/src/crypto/tls/conn.go:800 +0xd3
        crypto/tls.(*serverHandshakeState).doFullHandshake(0xc82080fb18, 0x0, 0x0)
            /usr/local/go/src/crypto/tls/handshake_server.go:413 +0xae7
        crypto/tls.(*Conn).serverHandshake(0xc82054b800, 0x0, 0x0)
            /usr/local/go/src/crypto/tls/handshake_server.go:80 +0x31c
        crypto/tls.(*Conn).Handshake(0xc82054b800, 0x0, 0x0)
            /usr/local/go/src/crypto/tls/conn.go:1035 +0x169
        google.golang.org/grpc/credentials.(*tlsCreds).ServerHandshake(0xc820034070, 0x7fd2a1c71c68, 0xc82015c0f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
            /go/src/google.golang.org/grpc/credentials/credentials.go:183 +0x11d
        google.golang.org/grpc.(*Server).useTransportAuthenticator(0xc82058adc0, 0x7fd2a1c71c68, 0xc82015c0f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
            /go/src/google.golang.org/grpc/server.go:309 +0xbd
        google.golang.org/grpc.(*Server).handleRawConn(0xc82058adc0, 0x7fd2a1c71c68, 0xc82015c0f0)
            /go/src/google.golang.org/grpc/server.go:352 +0x4a
        created by google.golang.org/grpc.(*Server).Serve
            /go/src/google.golang.org/grpc/server.go:345 +0x38c

Run Details:

95 runs so far, 0 failures, over 5s
130 runs completed, 1 failures, over 7s
FAIL

Please assign, take a look and update the issue accordingly.

@cockroach-teamcity cockroach-teamcity added this to the Q3 milestone Jul 30, 2016
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jul 30, 2016
@tamird
Copy link
Contributor

tamird commented Jul 30, 2016

#7793.

@tamird
Copy link
Contributor

tamird commented Jul 30, 2016

Upon closer inspection, there are 3 leaked goroutines here:

  1. crypto/tls.(*Conn).serverHandshake [IO wait]
  2. crypto/tls.(*Conn).clientHandshake [IO wait]
  3. google.golang.org/grpc/credentials.(*tlsCreds).ClientHandshake [chan receive]

The 3rd goroutine is grpc on the client side waiting for authentication to finish. My upstream PR grpc/grpc-go#796 which introduces better cancellation makes this guy go away.

The first two goroutines are what's really troubling here. This looks like a deadlocked TLS handshake, which implies a bug in the standard library. I believe we've seen aberrations of this bug here and there, but I'm now able to reproduce it locally using this test in under 1000 runs.

cc @bdarnell @bradfitz @petermattis @tschottdorf

EDIT: in case members of the Go team decide to help out here, I've been running this test using make stress PKG=./rpc TESTS=TestRemoteOffsetUnhealthy.

EDIT 2: after #8145 is merged, the invocation changes to make stress COCKROACH_IGNORESTDLIBLEAKS=0 PKG=./rpc TESTS=TestRemoteOffsetUnhealthy.

@tamird tamird self-assigned this Jul 30, 2016
@petermattis
Copy link
Collaborator

@tamird Can we ignore those goroutines for leak checking? At least temporarily.

@tamird
Copy link
Contributor

tamird commented Jul 30, 2016

Done in #8145.

@tamird tamird closed this as completed Jul 30, 2016
@tamird tamird reopened this Jul 30, 2016
@tamird
Copy link
Contributor

tamird commented Jul 30, 2016

This doesn't seem to reproduce in go1.7rc3.

@tamird
Copy link
Contributor

tamird commented Jul 30, 2016

Perhaps it has something to do with golang/go#14539 which was "fixed" in the 1.7 cycle.

@tamird
Copy link
Contributor

tamird commented Jul 30, 2016

Also possibly related is golang/go#14548, though that was darwin-specific.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

3 participants