runtime: random fatal errors in runtime #10941

methane · 2015-05-23T18:27:26Z

Version: tip 8017ace
Program: https://github.com/methane/hellobench.go/blob/master/hello.go
Stack dump: https://gist.github.com/methane/9701f5c7f58e2d701b65
To reproduce:

run program
wrk -d1000 -c32 -t800 http://localhost:8080/plaintext

I've encountered this problem two times.
But I forgot to get core file.

Is this known bug?

The text was updated successfully, but these errors were encountered:

methane · 2015-05-24T06:48:03Z

I can reproduce it with more simple code: https://gist.github.com/methane/7f0e381b6bde87dd8469#file-hello-go

$ ~/local/go-dev/bin/go version
go version devel +8017ace Sat May 23 17:42:43 2015 +0000 linux/amd64
$ ~/local/go-dev/bin/go build hello.go
$ GOMAXPROCS=16 ./hello
$ wrk -t4 -c800 http://127.0.0.1:8080/ -d10000

OS is Linux amd64

methane · 2015-05-24T07:03:54Z

But this time I got different crash message:

runtime: g678: leftover defer sp=0x0 pc=0xc208948800
    defer 0xc208c77140 sp=0x0 pc=0xc208948800
fatal error: traceback has leftover defers

https://gist.github.com/methane/7f0e381b6bde87dd8469#file-crashdump

davecheney · 2015-05-24T07:11:09Z

Thanks for your report. I have not been able to reproduced either of these crashes yet.

% go version
go version devel +001438b Thu May 21 19:14:03 2015 +0000 linux/amd64

Are you able to reproduce the crash outside the ec2 environment ?

methane · 2015-05-24T07:21:17Z

I think more concurrency makes easy to reproduce it.
I haven't reproduced it with GOMAXPROCS=4.
But I don't have manycore machine. I always use EC2 when I need power.

minux · 2015-05-24T07:49:33Z

I can't reproduce any crashes either. Even on a machine with 16 cores. $ go version go version devel +8017ace Sat May 23 17:42:43 2015 +0000 linux/amd64

dspezia · 2015-05-24T07:58:11Z

I cannot reproduce on small VMs. Will try later on larger ones.
How long do you have to let it run to produce a crash?

$ go version
go version devel +eeb64b7 Sat May 23 05:22:57 2015 +0000 linux/amd64

dspezia · 2015-05-24T09:28:37Z

I reproduced it on an EC2 instance with 36 vCPUs (and GOMAXPROCS=36), after 10 min of traffic injection at more than 400K r/s (about 200M requests).

runtime: g1645: leftover defer sp=0xdeaddeaddeaddead pc=0x4dfafe
        defer 0xc208257bc0 sp=0xdeaddeaddeaddead pc=0x4dfafe
        defer 0xc2080eeaa0 sp=0xc20a43bc80 pc=0x469ea6
fatal error: traceback has leftover defers

runtime stack:
runtime.throw(0x73c8d0, 0x1d)
        /home/ec2-user/go/src/runtime/panic.go:527 +0x96
runtime.gentraceback(0x428fbc, 0xc20a43b4b0, 0x0, 0xc20a53a280, 0x0, 0x0, 0x7fffffff, 0x7fae5cff8cb8, 0x0, 0x0, ...)
        /home/ec2-user/go/src/runtime/traceback.go:450 +0x7b9
runtime.scanstack(0xc20a53a280)
        /home/ec2-user/go/src/runtime/mgcmark.go:319 +0x28b
runtime.gcphasework(0xc20a53a280)
        /home/ec2-user/go/src/runtime/mgcmark.go:270 +0x5a
runtime.stopg(0xc20a53a280, 0x4)
        /home/ec2-user/go/src/runtime/proc1.go:395 +0x91
runtime.markroot(0xc20801a000, 0x4b6)
        /home/ec2-user/go/src/runtime/mgcmark.go:138 +0x1f2
runtime.parfordo(0xc20801a000)
        /home/ec2-user/go/src/runtime/parfor.go:95 +0x12d
runtime.gcscan_m()
        /home/ec2-user/go/src/runtime/mgcmark.go:37 +0xe1
runtime.gc.func1()
        /home/ec2-user/go/src/runtime/mgc.go:829 +0x5f
runtime.systemstack(0xc20802c000)
        /home/ec2-user/go/src/runtime/asm_amd64.s:262 +0x7c
runtime.mstart()
        /home/ec2-user/go/src/runtime/proc1.go:656

goroutine 194 [garbage collection scan]:
runtime.systemstack_switch()
        /home/ec2-user/go/src/runtime/asm_amd64.s:216 fp=0xc2088a6d90 sp=0xc2088a6d88
runtime.gc(0x0)
        /home/ec2-user/go/src/runtime/mgc.go:841 +0x1dc fp=0xc2088a6fb0 sp=0xc2088a6d90
runtime.backgroundgc()
        /home/ec2-user/go/src/runtime/mgc.go:761 +0x40 fp=0xc2088a6fe0 sp=0xc2088a6fb0
runtime.goexit()
        /home/ec2-user/go/src/runtime/asm_amd64.s:1670 +0x1 fp=0xc2088a6fe8 sp=0xc2088a6fe0
created by runtime.startGC
        /home/ec2-user/go/src/runtime/mgc.go:734 +0x14a
. . .

$ go version
go version devel +8017ace Sat May 23 17:42:43 2015 +0000 linux/amd64

methane · 2015-05-24T09:49:27Z

When I'm lucky (?), I can reproduce it in 1m30sec.

RLH · 2015-05-24T15:43:04Z

https://go-review.googlesource.com/#/c/9373/

Are folks using the stress tool? I have found it makes me lucky...

On Sun, May 24, 2015 at 5:49 AM, INADA Naoki [email protected]
wrote:

When I'm lucky (?), I can reproduce it in 1m30sec.

—
Reply to this email directly or view it on GitHub
#10941 (comment).

methane · 2015-05-24T20:39:18Z

@RLH Is it use processes to run parallel test?
I think more GOMAXPROCS make it easier to reproduce it than multi processes.

Here is new code to reproduce. This doesn't require wrk.
https://gist.github.com/methane/a874718dea95cc031f35

I have reproduced once on c4.4xlarge (16cores) within 1 hour.
Crash dump is different from previous code.

fatal error: acquireSudog: found s.elem != nil in cache

goroutine 55 [running]:
runtime.throw(0x75d970, 0x2a)
    /Users/inada-n/local/go/src/runtime/panic.go:527 +0x96 fp=0xc208d8e5a8 sp=0xc208d8e590
runtime.acquireSudog(0xc208e03700)
    /Users/inada-n/local/go/src/runtime/proc.go:232 +0x332 fp=0xc208d8e640 sp=0xc208d8e5a8
runtime.selectgoImpl(0xc208d8e938, 0x0, 0x18)
    /Users/inada-n/local/go/src/runtime/select.go:369 +0x889 fp=0xc208d8e7e8 sp=0xc208d8e640
runtime.selectgo(0xc208d8e938)
    /Users/inada-n/local/go/src/runtime/select.go:212 +0x12 fp=0xc208d8e808 sp=0xc208d8e7e8
net/http.(*persistConn).roundTrip(0xc2080fe000, 0xc2085da420, 0x0, 0x0, 0x0)
    /Users/inada-n/local/go/src/net/http/transport.go:1165 +0x8e1 fp=0xc208d8ea40 sp=0xc208d8e808
net/http.(*Transport).RoundTrip(0xc20806a120, 0xc208100f70, 0xc207ff2546, 0x0, 0x0)
    /Users/inada-n/local/go/src/net/http/transport.go:235 +0x533 fp=0xc208d8eb60 sp=0xc208d8ea40
net/http.send(0xc208100f70, 0x7fdfb4a2b468, 0xc20806a120, 0x16, 0x0, 0x0)
    /Users/inada-n/local/go/src/net/http/client.go:220 +0x4e1 fp=0xc208d8ec58 sp=0xc208d8eb60
net/http.(*Client).send(0x87cd20, 0xc208100f70, 0x16, 0x0, 0x0)
    /Users/inada-n/local/go/src/net/http/client.go:143 +0x15d fp=0xc208d8ed28 sp=0xc208d8ec58
net/http.(*Client).doFollowingRedirects(0x87cd20, 0xc208100f70, 0x78d1d8, 0x0, 0x0, 0x0)
    /Users/inada-n/local/go/src/net/http/client.go:380 +0xbc3 fp=0xc208d8ef28 sp=0xc208d8ed28
net/http.(*Client).Get(0x87cd20, 0xc208010540, 0x16, 0xd, 0x0, 0x0)
    /Users/inada-n/local/go/src/net/http/client.go:306 +0xad fp=0xc208d8ef78 sp=0xc208d8ef28
main.client()
    /Users/inada-n/work/hellobench.go/hello/repro.go:43 +0x44 fp=0xc208d8efe0 sp=0xc208d8ef78
runtime.goexit()
    /Users/inada-n/local/go/src/runtime/asm_amd64.s:1670 +0x1 fp=0xc208d8efe8 sp=0xc208d8efe0
created by main.main
    /Users/inada-n/work/hellobench.go/hello/repro.go:34 +0x122

goroutine 1 [sleep]:
time.Sleep(0x3b9aca00)
    /Users/inada-n/local/go/src/runtime/time.go:59 +0xfc
main.main()
    /Users/inada-n/work/hellobench.go/hello/repro.go:37 +0x142

goroutine 5 [IO wait, 46 minutes]:
net.runtime_pollWait(0x7fdfb4a30590, 0x72, 0xc20800a1a0)
    /Users/inada-n/local/go/src/runtime/netpoll.go:157 +0x63
net.(*pollDesc).Wait(0xc208042140, 0x72, 0x0, 0x0)
...

It was cross compiled on Mac.

$ ~/local/go/bin/go version
go version devel +8017ace Sat May 23 17:42:43 2015 +0000 darwin/amd64

dspezia · 2015-05-24T21:43:44Z

I have reproduced it with wrk on a physical box (32 HW threads, GOMAXPROCS=32), running SLES11SP2, in 15 minutes. The problem does not only occur on AWS EC2 instances.

runtime: g1256: leftover defer sp=0x0 pc=0xc208fb6800
        defer 0xc208510ec0 sp=0xc20a55c610 pc=0x4dfafe
        defer 0xc2096c4900 sp=0x0 pc=0xc208fb6800
fatal error: unexpected signal during runtime execution
[signal 0xb code=0x1 addr=0x15 pc=0x446634]

runtime stack:
runtime.throw(0x751dd0, 0x2a)
        /data2/ds/local/go/src/runtime/panic.go:527 +0x96
runtime.sigpanic()
        /data2/ds/local/go/src/runtime/sigpanic_unix.go:12 +0x5d
runtime.gentraceback(0x428fbc, 0xc20a55c538, 0x0, 0xc209680780, 0x0, 0x0, 0x7fffffff, 0x7f48ef559c78, 0x0, 0x0, ...)
        /data2/ds/local/go/src/runtime/traceback.go:448 +0x6d4
runtime.scanstack(0xc209680780)
        /data2/ds/local/go/src/runtime/mgcmark.go:319 +0x28b
runtime.gcphasework(0xc209680780)
        /data2/ds/local/go/src/runtime/mgcmark.go:270 +0x5a
runtime.stopg(0xc209680780, 0x1)
        /data2/ds/local/go/src/runtime/proc1.go:395 +0x91
runtime.markroot(0xc20801a060, 0x359)
        /data2/ds/local/go/src/runtime/mgcmark.go:138 +0x1f2
runtime.parfordo(0xc20801a060)
        /data2/ds/local/go/src/runtime/parfor.go:95 +0x12d
runtime.gcscan_m()
        /data2/ds/local/go/src/runtime/mgcmark.go:37 +0xe1
runtime.gc.func1()
        /data2/ds/local/go/src/runtime/mgc.go:829 +0x5f
runtime.systemstack(0xc208036000)
        /data2/ds/local/go/src/runtime/asm_amd64.s:262 +0x7c
runtime.mstart()
        /data2/ds/local/go/src/runtime/proc1.go:656

goroutine 82 [garbage collection scan]: 
runtime.systemstack_switch()
        /data2/ds/local/go/src/runtime/asm_amd64.s:216 fp=0xc2081d0590 sp=0xc2081d0588
runtime.gc(0x0)
        /data2/ds/local/go/src/runtime/mgc.go:841 +0x1dc fp=0xc2081d07b0 sp=0xc2081d0590
runtime.backgroundgc()
        /data2/ds/local/go/src/runtime/mgc.go:761 +0x40 fp=0xc2081d07e0 sp=0xc2081d07b0
runtime.goexit()
        /data2/ds/local/go/src/runtime/asm_amd64.s:1670 +0x1 fp=0xc2081d07e8 sp=0xc2081d07e0
created by runtime.startGC
        /data2/ds/local/go/src/runtime/mgc.go:734 +0x14a

$ go version
go version devel +8017ace Sat May 23 17:42:43 2015 +0000 linux/amd64

methane · 2015-05-25T04:06:00Z

I reproduced it on 8core machine.
It takes 272min 43sec.

fatal error: unexpected signal during runtime execution
[signal 0xb code=0x1 addr=0x53 pc=0x417c0e]

goroutine 519 [running]:
runtime.throw(0x765d70, 0x2a)
        /Users/inada-n/local/go/src/runtime/panic.go:527 +0x96 fp=0xc20801f530 sp=0xc20801f518
runtime.sigpanic()
        /Users/inada-n/local/go/src/runtime/sigpanic_unix.go:12 +0x5d fp=0xc20801f580 sp=0xc20801f530
runtime.clearpools()
        /Users/inada-n/local/go/src/runtime/mgc.go:1439 +0xce fp=0xc20801f590 sp=0xc20801f580
runtime.gc(0x0)
        /Users/inada-n/local/go/src/runtime/mgc.go:812 +0x165 fp=0xc20801f7b0 sp=0xc20801f590
runtime.backgroundgc()
        /Users/inada-n/local/go/src/runtime/mgc.go:761 +0x40 fp=0xc20801f7e0 sp=0xc20801f7b0
runtime.goexit()
        /Users/inada-n/local/go/src/runtime/asm_amd64.s:1670 +0x1 fp=0xc20801f7e8 sp=0xc20801f7e0
created by runtime.startGC
        /Users/inada-n/local/go/src/runtime/mgc.go:734 +0x14a

methane · 2015-06-04T13:39:39Z

May I try to take a core dump?

gopherbot · 2015-06-04T23:00:43Z

CL https://golang.org/cl/10713 mentions this issue.

methane · 2015-06-05T03:30:12Z

I merged CL https://golang.org/cl/10713 and reproduce it:
https://gist.github.com/methane/63d209366b696736ec4f

gopherbot · 2015-06-05T22:00:34Z

CL https://golang.org/cl/10791 mentions this issue.

aclements · 2015-06-05T22:05:53Z

@methane, thanks for trying out CL 10713. Unfortunately, your reproduce didn't hit the path that CL 10713 adds debugging to. It looks like your original traceback would have, though. If you don't mind, could you try reproducing it a few more times to see if you can get the "free list corrupted" panic from CL 10713?

Even better, if you can reproduce it reliably enough, it would be great if you could try CL 10791.

Issues golang#10240, golang#10541, golang#10941, golang#11023, golang#11027 and possibly others are indicating memory corruption in the runtime. One of the easiest places to both get corruption and detect it is in the allocator's free lists since they appear throughout memory and follow strict invariants. This commit adds a check when sweeping a span that its free list is sane and, if not, it prints the corrupted free list and panics. Hopefully this will help us collect more information on these failures. Change-Id: I6d417bcaeedf654943a5e068bd76b58bb02d4a64

methane · 2015-06-06T03:58:44Z

@aclements OK, I've merged two CLs methane@3566a7d
and reproduce the problem with it.
https://gist.github.com/methane/56cc12a4030b493df82b#file-1-x2-txt

Stack barriers assume that writes through pointers to frames above the current frame will get write barriers, and hence these frames do not need to be re-scanned to pick up these changes. For normal writes, this is true. However, there are places in the runtime that use typedmemmove to potentially write through pointers to higher frames (such as mapassign1). Currently, typedmemmove does not execute write barriers if the destination is on the stack. If there's a stack barrier between the current frame and the frame being modified with typedmemmove, and the stack barrier is not otherwise hit, it's possible that the garbage collector will never see the updated pointer and incorrectly reclaim the object. Fix this by making heapBitsBulkBarrier (which lies behind typedmemmove and its variants) detect when the destination is in the stack and unwind stack barriers up to the point, forcing mark termination to later rescan the effected frame and collect these pointers. Fixes #11084. Might be related to #10240, #10541, #10941, #11023, #11027 and possibly others. Change-Id: I323d6cd0f1d29fa01f8fc946f4b90e04ef210efd Reviewed-on: https://go-review.googlesource.com/10791 Reviewed-by: Russ Cox <[email protected]>

dspezia · 2015-06-08T09:54:30Z

Following the sequence of CLs 10795, 10791, 10794, 10801 by @aclements, I have tried to reproduce this issue on a physical box. Actually I can reproduce with or without the CLs applied.

Without the CLs:
https://gist.github.com/dspezia/d7e44dd0f60a7d8aefc3#file-test1
https://gist.github.com/dspezia/d7e44dd0f60a7d8aefc3#file-test2

With the CLs:
https://gist.github.com/dspezia/d7e44dd0f60a7d8aefc3#file-test3
https://gist.github.com/dspezia/d7e44dd0f60a7d8aefc3#file-test4
https://gist.github.com/dspezia/d7e44dd0f60a7d8aefc3#file-test5

So it does not seem to help on this specific issue.

eastany · 2015-06-09T08:12:09Z

I met the same thing~.
But build my project with go1.2, it works well.

runtime: g2143: leftover defer sp=0xdeaddeaddeaddead pc=0x48cc36
    defer 0xc20885ea80 sp=0xdeaddeaddeaddead pc=0x48cc36
fatal error: traceback has leftover defers

runtime stack:
runtime.throw(0x99add0, 0x1d)
    /home/work/go/src/runtime/panic.go:527 +0x96
runtime.gentraceback(0x42a799, 0xc208fea6c0, 0x0, 0xc208481400, 0x0, 0x0, 0x7fffffff, 0x7fa93cfb5c58, 0x0, 0x0, ...)
    /home/work/go/src/runtime/traceback.go:450 +0x7b0
runtime.scanstack(0xc208481400)
    /home/work/go/src/runtime/mgcmark.go:319 +0x28b
runtime.gcphasework(0xc208481400)
    /home/work/go/src/runtime/mgcmark.go:270 +0x7f
runtime.stopg(0xc208481400, 0x4)
    /home/work/go/src/runtime/proc1.go:395 +0x91
runtime.markroot(0xc20801a000, 0x3f)
    /home/work/go/src/runtime/mgcmark.go:138 +0x1f2
runtime.parfordo(0xc20801a000)
    /home/work/go/src/runtime/parfor.go:95 +0x12d
runtime.gcscan_m()
    /home/work/go/src/runtime/mgcmark.go:37 +0xde
runtime.gc.func1()
    /home/work/go/src/runtime/mgc.go:830 +0x5f
runtime.systemstack(0xc20801c000)
    /home/work/go/src/runtime/asm_amd64.s:262 +0x79
runtime.mstart()
    /home/work/go/src/runtime/proc1.go:719

go version devel +7eec656 Wed May 20 07:54:37 2015 +0000 linux/amd64

methane · 2015-06-11T17:47:41Z

I find some fixes relating to GC are committed.
I'm running my code with a788c91.
It runs 30 minutes without crash. I'll post update about 12 hours later.

methane · 2015-06-11T23:06:07Z

It was crashed. :-/

stkbarPos=0 len(stkbar)=1 goid=967 gcphase=2
fatal error: g already has stack barriers

runtime stack:
runtime.throw(0x751e10, 0x1c)
        /Users/inada-n/local/go/src/runtime/panic.go:527 +0x96
runtime.scanstack(0xc2088c6000)
        /Users/inada-n/local/go/src/runtime/mgcmark.go:334 +0x4f4
runtime.gcphasework(0xc2088c6000)
        /Users/inada-n/local/go/src/runtime/mgcmark.go:273 +0x6d
runtime.stopg(0xc2088c6000, 0x4)
        /Users/inada-n/local/go/src/runtime/proc1.go:395 +0x8e
runtime.markroot(0xc208018000, 0xc2000002e8)
        /Users/inada-n/local/go/src/runtime/mgcmark.go:138 +0x1ef
runtime.parfordo(0xc208018000)
        /Users/inada-n/local/go/src/runtime/parfor.go:95 +0x12a
runtime.gcscan_m()
        /Users/inada-n/local/go/src/runtime/mgcmark.go:37 +0xde
runtime.gc.func1()
        /Users/inada-n/local/go/src/runtime/mgc.go:876 +0x5c
runtime.systemstack(0xc20801a000)
        /Users/inada-n/local/go/src/runtime/asm_amd64.s:262 +0x7c
runtime.mstart()
        /Users/inada-n/local/go/src/runtime/proc1.go:656

goroutine 864 [garbage collection scan]:
runtime.systemstack_switch()
        /Users/inada-n/local/go/src/runtime/asm_amd64.s:216 fp=0xc2080d6ce8 sp=0xc2080d6ce0
runtime.gc(0x0)
        /Users/inada-n/local/go/src/runtime/mgc.go:887 +0x1d9 fp=0xc2080d6f90 sp=0xc2080d6ce8
runtime.backgroundgc()
        /Users/inada-n/local/go/src/runtime/mgc.go:783 +0x3d fp=0xc2080d6fc0 sp=0xc2080d6f90
runtime.goexit()
        /Users/inada-n/local/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc2080d6fc8 sp=0xc2080d6fc0
created by runtime.startGC
        /Users/inada-n/local/go/src/runtime/mgc.go:756 +0x147

aclements · 2015-06-12T17:35:26Z

Thanks for running it again, @methane. This actually looks like good progress. Your crash is definitely interesting and I'll need to dig in to it, but it's almost certainly not memory corruption like your other crashes have been. If it's easy, can you continue stress testing and see if you get any other types of crashes?

methane · 2015-06-14T03:22:33Z

I've patched throw() like this to see backtrace in gdb:

--- a/src/runtime/panic.go
+++ b/src/runtime/panic.go
@@ -519,11 +519,6 @@ func dopanic(unused int) {
 //go:nosplit
 func throw(s string) {
        print("fatal error: ", s, "\n")
-       gp := getg()
-       if gp.m.throwing == 0 {
-               gp.m.throwing = 1
-       }
-       startpanic()
-       dopanic(0)
+       crash()
        *(*int)(nil) = 0 // not reached
 }

And reproduced it (g already has stack barriers).

(gdb) p/x *gp
$6 = {stack = {lo = 0xc208d75000, hi = 0xc208d75fd0}, stackguard0 = 0xc208d75280, stackguard1 = 0xffffffffffffffff, _panic = 0x0, _defer = 0x0, stackAlloc = 0x1000, sched = {
    sp = 0xc208d75c50, pc = 0x42a4d9, g = 0xc2081f2580, ctxt = 0x0, ret = 0x0, lr = 0x0, bp = 0x2}, syscallsp = 0x0, syscallpc = 0x58cf75, stkbar = {array = 0xc208d75fd0,
    len = 0x1, cap = 0x3}, stkbarPos = 0x1, param = 0x0, atomicstatus = 0x1004, goid = 0x106, waitsince = 0x4a3fb6b9787544, waitreason = 0x70e7f0 "select",
  schedlink = 0xc2082dd600, preempt = 0x0, paniconfault = 0x0, preemptscan = 0x0, gcworkdone = 0x0, gcscanvalid = 0x0, throwsplit = 0x0, raceignore = 0x0, m = 0x0, lockedm = 0x0,
  sig = 0x0, writebuf = {array = 0x0, len = 0x0, cap = 0x0}, sigcode0 = 0x0, sigcode1 = 0x0, sigpc = 0x0, gopc = 0x480470, startpc = 0x483420, racectx = 0x0,
  waiting = 0xc20908e640, readyg = 0x0, gcalloc = 0x0, gcscanwork = 0x0}
(gdb) bt
#0  runtime.raise () at /Users/inada-n/local/go/src/runtime/sys_linux_amd64.s:104
#1  0x000000000043a977 in runtime.crash () at /Users/inada-n/local/go/src/runtime/signal1_unix.go:158
#2  0x0000000000427f8d in runtime.throw (s=...) at /Users/inada-n/local/go/src/runtime/panic.go:522
#3  0x00000000004192f4 in runtime.scanstack (gp=0xc2081f2580) at /Users/inada-n/local/go/src/runtime/mgcmark.go:334
#4  0x0000000000418dad in runtime.gcphasework (gp=0xc2081f2580) at /Users/inada-n/local/go/src/runtime/mgcmark.go:273
#5  0x000000000042c41e in runtime.stopg (gp=0xc2081f2580, ~r1=4) at /Users/inada-n/local/go/src/runtime/proc1.go:395
#6  0x000000000041883f in runtime.markroot (desc=0xc208018000, i=198) at /Users/inada-n/local/go/src/runtime/mgcmark.go:138
#7  0x0000000000428aea in runtime.parfordo (desc=0xc208018000) at /Users/inada-n/local/go/src/runtime/parfor.go:95
#8  0x000000000041858e in runtime.gcscan_m () at /Users/inada-n/local/go/src/runtime/mgcmark.go:37
#9  0x000000000044b70c in runtime.gc.func1 () at /Users/inada-n/local/go/src/runtime/mgc.go:876
#10 0x0000000000455c6c in runtime.systemstack () at /Users/inada-n/local/go/src/runtime/asm_amd64.s:262
#11 0x000000000042cd50 in runtime.startTheWorldWithSema () at /Users/inada-n/local/go/src/runtime/proc1.go:591
#12 0x000000000053e751 in net/textproto.(*Reader).readLineSlice (r=0x0, ~r0=..., ~r1=...) at /Users/inada-n/local/go/src/net/textproto/reader.go:55
#13 0x0000000000000000 in ?? ()

It seems the gp is not corrupted.

methane · 2015-06-14T03:27:19Z

@aclements It seems it's harder to crash than before. It takes some hours.
I saw the crash only two times. And both crash was "g already has stack barriers".

I've started stress testing on temporary idle server in my company.
I'll use GCP Compute Engine or AWS EC2 in next week.

methane · 2015-06-14T06:19:29Z

I reproduced crash several times. All of them are "g already has stack barriers".

grep stkbar shows:

x.txt:stkbarPos=0 len(stkbar)=1 goid=2105 gcphase=2
x.txt:stkbarPos=1 len(stkbar)=1 goid=169 gcphase=2
x.txt:stkbarPos=1 len(stkbar)=2 goid=318 gcphase=2
x.txt:stkbarPos=0 len(stkbar)=1 goid=1913 gcphase=2
x.txt:stkbarPos=1 len(stkbar)=1 goid=731 gcphase=2
x.txt:stkbarPos=0 len(stkbar)=1 goid=1299 gcphase=2
x.txt:stkbarPos=0 len(stkbar)=2 goid=176 gcphase=2
x.txt:stkbarPos=2 len(stkbar)=2 goid=551 gcphase=2
x.txt:stkbarPos=1 len(stkbar)=1 goid=282 gcphase=2
x.txt:stkbarPos=1 len(stkbar)=2 goid=638 gcphase=2
x.txt:stkbarPos=1 len(stkbar)=1 goid=1426 gcphase=2
stress.0:stkbarPos=1 len(stkbar)=1 goid=459 gcphase=2
stress.1:stkbarPos=0 len(stkbar)=1 goid=1052 gcphase=2
stress.2:stkbarPos=0 len(stkbar)=1 goid=304 gcphase=2
stress.3:stkbarPos=0 len(stkbar)=1 goid=745 gcphase=2
stress.4:stkbarPos=1 len(stkbar)=1 goid=1465 gcphase=2
x.txt:stkbarPos=1 len(stkbar)=2 goid=829 gcphase=2
x.txt:stkbarPos=1 len(stkbar)=1 goid=453 gcphase=2
x.txt:stkbarPos=1 len(stkbar)=1 goid=222 gcphase=2
x.txt:stkbarPos=0 len(stkbar)=1 goid=21 gcphase=2
x.txt:stkbarPos=0 len(stkbar)=1 goid=336 gcphase=2
stress.0:stkbarPos=1 len(stkbar)=1 goid=419 gcphase=2

methane · 2015-06-14T06:39:08Z

It seems most of bugs causing runtime fatal error were fixed and one bug remains in concurrent GC?

dspezia · 2015-06-14T17:47:56Z

Same here - On 10 runs, I can now only reproduce the "g already has stack barriers" error.

hello.log:stkbarPos=0 len(stkbar)=1 goid=1025 gcphase=2
hello.log2:stkbarPos=0 len(stkbar)=1 goid=1018 gcphase=2
hello.log3:stkbarPos=0 len(stkbar)=1 goid=1133 gcphase=2
hello.log4:stkbarPos=0 len(stkbar)=1 goid=985 gcphase=2
hello.log5:stkbarPos=0 len(stkbar)=1 goid=950 gcphase=2
hello.log6:stkbarPos=1 len(stkbar)=1 goid=1046 gcphase=2
hello.log7:stkbarPos=2 len(stkbar)=2 goid=272 gcphase=2
hello.log8:stkbarPos=1 len(stkbar)=2 goid=417 gcphase=2
hello.log9:stkbarPos=1 len(stkbar)=1 goid=301 gcphase=2
hello.log10:stkbarPos=0 len(stkbar)=1 goid=1039 gcphase=2

I got 7 failures in the GC, like https://gist.github.com/dspezia/8ebae9a48007de7028c6#file-gc1

I got 3 failures while growing stacks:
https://gist.github.com/dspezia/8ebae9a48007de7028c6#file-stack1
https://gist.github.com/dspezia/8ebae9a48007de7028c6#file-stack2
https://gist.github.com/dspezia/8ebae9a48007de7028c6#file-stack3

go version devel +6b24da6 Sun Jun 14 01:52:54 2015 +0000 linux/amd64

aclements · 2015-06-14T18:18:20Z

Thanks @methane and @dspezia. It's clear that we have a race between the two places that trigger stack barrier insertion (which map exactly to the two classes of tracebacks you saw, @dspezia). I think at this point we've solved the original memory corruption issues. There's still one other related issue involving channels that has an outstanding CL, but it seems to be uninvolved in these tests. I'll dig in to the race when installing stack barriers.

If you're still stress testing, note that it's possible for this race to manifest as other failures. If we fail to catch the race at the "g already has stack barriers" point, we could get an index out of bounds in the runtime. I think it's also possible, though unlikely, for this race to cause the runtime to re-scan too little of the stack, which could cause a missed mark, which could manifest as memory corruption.

Issues #10240, #10541, #10941, #11023, #11027 and possibly others are indicating memory corruption in the runtime. One of the easiest places to both get corruption and detect it is in the allocator's free lists since they appear throughout memory and follow strict invariants. This commit adds a check when sweeping a span that its free list is sane and, if not, it prints the corrupted free list and panics. Hopefully this will help us collect more information on these failures. Change-Id: I6d417bcaeedf654943a5e068bd76b58bb02d4a64 Reviewed-on: https://go-review.googlesource.com/10713 Reviewed-by: Keith Randall <[email protected]> Reviewed-by: Russ Cox <[email protected]> Run-TryBot: Austin Clements <[email protected]>

methane · 2015-06-17T01:49:56Z

CL11089 may fixes this.
I started stress testing with the CL patch.

methane · 2015-06-17T08:49:48Z

I cannot reproduce the crash anymore.
I'll close this issue when CL11089 is merged.
Good job, Go developers!

gopherbot · 2015-06-17T16:00:29Z

CL https://golang.org/cl/11089 mentions this issue.

minux changed the title ~~Segmentation fault in malloc~~ runtime: segmentation fault in malloc May 24, 2015

minux added this to the Go1.5 milestone May 24, 2015

methane changed the title ~~runtime: segmentation fault in malloc~~ runtime: random fatal errors in runtime May 25, 2015

This was referenced Jun 1, 2015

runtime: crash with "fatal error: deferproc: d.panic != nil after newdefer" #11021

Closed

runtime: crash with "leftover defer" #11023

Closed

RLH mentioned this issue Jun 15, 2015

runtime: fatal error: g already has stack barriers #11210

Closed

rsc closed this as completed in 3c60e6e Jun 17, 2015

rhysh mentioned this issue Jun 23, 2015

runtime: crash with "fatal error: runtime: sudog with non-nil selectdone" #11042

Closed

golang locked and limited conversation to collaborators Jun 25, 2016

gopherbot added the FrozenDueToAge label Jun 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: random fatal errors in runtime #10941

runtime: random fatal errors in runtime #10941

methane commented May 23, 2015

methane commented May 24, 2015

methane commented May 24, 2015

davecheney commented May 24, 2015

methane commented May 24, 2015

minux commented May 24, 2015 via email

dspezia commented May 24, 2015

dspezia commented May 24, 2015

methane commented May 24, 2015

RLH commented May 24, 2015

methane commented May 24, 2015

dspezia commented May 24, 2015

methane commented May 25, 2015

methane commented Jun 4, 2015

gopherbot commented Jun 4, 2015

methane commented Jun 5, 2015

gopherbot commented Jun 5, 2015

aclements commented Jun 5, 2015

methane commented Jun 6, 2015

dspezia commented Jun 8, 2015

eastany commented Jun 9, 2015

methane commented Jun 11, 2015

methane commented Jun 11, 2015

aclements commented Jun 12, 2015

methane commented Jun 14, 2015

methane commented Jun 14, 2015

methane commented Jun 14, 2015

methane commented Jun 14, 2015

dspezia commented Jun 14, 2015

aclements commented Jun 14, 2015

methane commented Jun 17, 2015

methane commented Jun 17, 2015

gopherbot commented Jun 17, 2015

runtime: random fatal errors in runtime #10941

runtime: random fatal errors in runtime #10941

Comments

methane commented May 23, 2015

methane commented May 24, 2015

methane commented May 24, 2015

davecheney commented May 24, 2015

methane commented May 24, 2015

minux commented May 24, 2015 via email

dspezia commented May 24, 2015

dspezia commented May 24, 2015

methane commented May 24, 2015

RLH commented May 24, 2015

methane commented May 24, 2015

dspezia commented May 24, 2015

methane commented May 25, 2015

methane commented Jun 4, 2015

gopherbot commented Jun 4, 2015

methane commented Jun 5, 2015

gopherbot commented Jun 5, 2015

aclements commented Jun 5, 2015

methane commented Jun 6, 2015

dspezia commented Jun 8, 2015

eastany commented Jun 9, 2015

methane commented Jun 11, 2015

methane commented Jun 11, 2015

aclements commented Jun 12, 2015

methane commented Jun 14, 2015

methane commented Jun 14, 2015

methane commented Jun 14, 2015

methane commented Jun 14, 2015

dspezia commented Jun 14, 2015

aclements commented Jun 14, 2015

methane commented Jun 17, 2015

methane commented Jun 17, 2015

gopherbot commented Jun 17, 2015