Skip to content

runtime: random fatal errors in runtime #10941

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
methane opened this issue May 23, 2015 · 32 comments
Closed

runtime: random fatal errors in runtime #10941

methane opened this issue May 23, 2015 · 32 comments
Milestone

Comments

@methane
Copy link
Contributor

methane commented May 23, 2015

Version: tip 8017ace
Program: https://github.com/methane/hellobench.go/blob/master/hello.go
Stack dump: https://gist.github.com/methane/9701f5c7f58e2d701b65
To reproduce:

I've encountered this problem two times.
But I forgot to get core file.

Is this known bug?

@minux minux changed the title Segmentation fault in malloc runtime: segmentation fault in malloc May 24, 2015
@minux minux added this to the Go1.5 milestone May 24, 2015
@methane
Copy link
Contributor Author

methane commented May 24, 2015

I can reproduce it with more simple code: https://gist.github.com/methane/7f0e381b6bde87dd8469#file-hello-go

$ ~/local/go-dev/bin/go version
go version devel +8017ace Sat May 23 17:42:43 2015 +0000 linux/amd64
$ ~/local/go-dev/bin/go build hello.go
$ GOMAXPROCS=16 ./hello
$ wrk -t4 -c800 http://127.0.0.1:8080/ -d10000

OS is Linux amd64

@methane
Copy link
Contributor Author

methane commented May 24, 2015

But this time I got different crash message:

runtime: g678: leftover defer sp=0x0 pc=0xc208948800
    defer 0xc208c77140 sp=0x0 pc=0xc208948800
fatal error: traceback has leftover defers

https://gist.github.com/methane/7f0e381b6bde87dd8469#file-crashdump

@davecheney
Copy link
Contributor

Thanks for your report. I have not been able to reproduced either of these crashes yet.

% go version
go version devel +001438b Thu May 21 19:14:03 2015 +0000 linux/amd64

Are you able to reproduce the crash outside the ec2 environment ?

@methane
Copy link
Contributor Author

methane commented May 24, 2015

I think more concurrency makes easy to reproduce it.
I haven't reproduced it with GOMAXPROCS=4.
But I don't have manycore machine. I always use EC2 when I need power.

@minux
Copy link
Member

minux commented May 24, 2015 via email

@dspezia
Copy link
Contributor

dspezia commented May 24, 2015

I cannot reproduce on small VMs. Will try later on larger ones.
How long do you have to let it run to produce a crash?

$ go version
go version devel +eeb64b7 Sat May 23 05:22:57 2015 +0000 linux/amd64

@dspezia
Copy link
Contributor

dspezia commented May 24, 2015

I reproduced it on an EC2 instance with 36 vCPUs (and GOMAXPROCS=36), after 10 min of traffic injection at more than 400K r/s (about 200M requests).

runtime: g1645: leftover defer sp=0xdeaddeaddeaddead pc=0x4dfafe
        defer 0xc208257bc0 sp=0xdeaddeaddeaddead pc=0x4dfafe
        defer 0xc2080eeaa0 sp=0xc20a43bc80 pc=0x469ea6
fatal error: traceback has leftover defers

runtime stack:
runtime.throw(0x73c8d0, 0x1d)
        /home/ec2-user/go/src/runtime/panic.go:527 +0x96
runtime.gentraceback(0x428fbc, 0xc20a43b4b0, 0x0, 0xc20a53a280, 0x0, 0x0, 0x7fffffff, 0x7fae5cff8cb8, 0x0, 0x0, ...)
        /home/ec2-user/go/src/runtime/traceback.go:450 +0x7b9
runtime.scanstack(0xc20a53a280)
        /home/ec2-user/go/src/runtime/mgcmark.go:319 +0x28b
runtime.gcphasework(0xc20a53a280)
        /home/ec2-user/go/src/runtime/mgcmark.go:270 +0x5a
runtime.stopg(0xc20a53a280, 0x4)
        /home/ec2-user/go/src/runtime/proc1.go:395 +0x91
runtime.markroot(0xc20801a000, 0x4b6)
        /home/ec2-user/go/src/runtime/mgcmark.go:138 +0x1f2
runtime.parfordo(0xc20801a000)
        /home/ec2-user/go/src/runtime/parfor.go:95 +0x12d
runtime.gcscan_m()
        /home/ec2-user/go/src/runtime/mgcmark.go:37 +0xe1
runtime.gc.func1()
        /home/ec2-user/go/src/runtime/mgc.go:829 +0x5f
runtime.systemstack(0xc20802c000)
        /home/ec2-user/go/src/runtime/asm_amd64.s:262 +0x7c
runtime.mstart()
        /home/ec2-user/go/src/runtime/proc1.go:656

goroutine 194 [garbage collection scan]:
runtime.systemstack_switch()
        /home/ec2-user/go/src/runtime/asm_amd64.s:216 fp=0xc2088a6d90 sp=0xc2088a6d88
runtime.gc(0x0)
        /home/ec2-user/go/src/runtime/mgc.go:841 +0x1dc fp=0xc2088a6fb0 sp=0xc2088a6d90
runtime.backgroundgc()
        /home/ec2-user/go/src/runtime/mgc.go:761 +0x40 fp=0xc2088a6fe0 sp=0xc2088a6fb0
runtime.goexit()
        /home/ec2-user/go/src/runtime/asm_amd64.s:1670 +0x1 fp=0xc2088a6fe8 sp=0xc2088a6fe0
created by runtime.startGC
        /home/ec2-user/go/src/runtime/mgc.go:734 +0x14a
. . .

$ go version
go version devel +8017ace Sat May 23 17:42:43 2015 +0000 linux/amd64

@methane
Copy link
Contributor Author

methane commented May 24, 2015

When I'm lucky (?), I can reproduce it in 1m30sec.

@RLH
Copy link
Contributor

RLH commented May 24, 2015

https://go-review.googlesource.com/#/c/9373/

Are folks using the stress tool? I have found it makes me lucky...

On Sun, May 24, 2015 at 5:49 AM, INADA Naoki [email protected]
wrote:

When I'm lucky (?), I can reproduce it in 1m30sec.


Reply to this email directly or view it on GitHub
#10941 (comment).

@methane
Copy link
Contributor Author

methane commented May 24, 2015

@RLH Is it use processes to run parallel test?
I think more GOMAXPROCS make it easier to reproduce it than multi processes.

Here is new code to reproduce. This doesn't require wrk.
https://gist.github.com/methane/a874718dea95cc031f35

I have reproduced once on c4.4xlarge (16cores) within 1 hour.
Crash dump is different from previous code.

fatal error: acquireSudog: found s.elem != nil in cache

goroutine 55 [running]:
runtime.throw(0x75d970, 0x2a)
    /Users/inada-n/local/go/src/runtime/panic.go:527 +0x96 fp=0xc208d8e5a8 sp=0xc208d8e590
runtime.acquireSudog(0xc208e03700)
    /Users/inada-n/local/go/src/runtime/proc.go:232 +0x332 fp=0xc208d8e640 sp=0xc208d8e5a8
runtime.selectgoImpl(0xc208d8e938, 0x0, 0x18)
    /Users/inada-n/local/go/src/runtime/select.go:369 +0x889 fp=0xc208d8e7e8 sp=0xc208d8e640
runtime.selectgo(0xc208d8e938)
    /Users/inada-n/local/go/src/runtime/select.go:212 +0x12 fp=0xc208d8e808 sp=0xc208d8e7e8
net/http.(*persistConn).roundTrip(0xc2080fe000, 0xc2085da420, 0x0, 0x0, 0x0)
    /Users/inada-n/local/go/src/net/http/transport.go:1165 +0x8e1 fp=0xc208d8ea40 sp=0xc208d8e808
net/http.(*Transport).RoundTrip(0xc20806a120, 0xc208100f70, 0xc207ff2546, 0x0, 0x0)
    /Users/inada-n/local/go/src/net/http/transport.go:235 +0x533 fp=0xc208d8eb60 sp=0xc208d8ea40
net/http.send(0xc208100f70, 0x7fdfb4a2b468, 0xc20806a120, 0x16, 0x0, 0x0)
    /Users/inada-n/local/go/src/net/http/client.go:220 +0x4e1 fp=0xc208d8ec58 sp=0xc208d8eb60
net/http.(*Client).send(0x87cd20, 0xc208100f70, 0x16, 0x0, 0x0)
    /Users/inada-n/local/go/src/net/http/client.go:143 +0x15d fp=0xc208d8ed28 sp=0xc208d8ec58
net/http.(*Client).doFollowingRedirects(0x87cd20, 0xc208100f70, 0x78d1d8, 0x0, 0x0, 0x0)
    /Users/inada-n/local/go/src/net/http/client.go:380 +0xbc3 fp=0xc208d8ef28 sp=0xc208d8ed28
net/http.(*Client).Get(0x87cd20, 0xc208010540, 0x16, 0xd, 0x0, 0x0)
    /Users/inada-n/local/go/src/net/http/client.go:306 +0xad fp=0xc208d8ef78 sp=0xc208d8ef28
main.client()
    /Users/inada-n/work/hellobench.go/hello/repro.go:43 +0x44 fp=0xc208d8efe0 sp=0xc208d8ef78
runtime.goexit()
    /Users/inada-n/local/go/src/runtime/asm_amd64.s:1670 +0x1 fp=0xc208d8efe8 sp=0xc208d8efe0
created by main.main
    /Users/inada-n/work/hellobench.go/hello/repro.go:34 +0x122

goroutine 1 [sleep]:
time.Sleep(0x3b9aca00)
    /Users/inada-n/local/go/src/runtime/time.go:59 +0xfc
main.main()
    /Users/inada-n/work/hellobench.go/hello/repro.go:37 +0x142

goroutine 5 [IO wait, 46 minutes]:
net.runtime_pollWait(0x7fdfb4a30590, 0x72, 0xc20800a1a0)
    /Users/inada-n/local/go/src/runtime/netpoll.go:157 +0x63
net.(*pollDesc).Wait(0xc208042140, 0x72, 0x0, 0x0)
...

It was cross compiled on Mac.

$ ~/local/go/bin/go version
go version devel +8017ace Sat May 23 17:42:43 2015 +0000 darwin/amd64

@dspezia
Copy link
Contributor

dspezia commented May 24, 2015

I have reproduced it with wrk on a physical box (32 HW threads, GOMAXPROCS=32), running SLES11SP2, in 15 minutes. The problem does not only occur on AWS EC2 instances.

runtime: g1256: leftover defer sp=0x0 pc=0xc208fb6800
        defer 0xc208510ec0 sp=0xc20a55c610 pc=0x4dfafe
        defer 0xc2096c4900 sp=0x0 pc=0xc208fb6800
fatal error: unexpected signal during runtime execution
[signal 0xb code=0x1 addr=0x15 pc=0x446634]

runtime stack:
runtime.throw(0x751dd0, 0x2a)
        /data2/ds/local/go/src/runtime/panic.go:527 +0x96
runtime.sigpanic()
        /data2/ds/local/go/src/runtime/sigpanic_unix.go:12 +0x5d
runtime.gentraceback(0x428fbc, 0xc20a55c538, 0x0, 0xc209680780, 0x0, 0x0, 0x7fffffff, 0x7f48ef559c78, 0x0, 0x0, ...)
        /data2/ds/local/go/src/runtime/traceback.go:448 +0x6d4
runtime.scanstack(0xc209680780)
        /data2/ds/local/go/src/runtime/mgcmark.go:319 +0x28b
runtime.gcphasework(0xc209680780)
        /data2/ds/local/go/src/runtime/mgcmark.go:270 +0x5a
runtime.stopg(0xc209680780, 0x1)
        /data2/ds/local/go/src/runtime/proc1.go:395 +0x91
runtime.markroot(0xc20801a060, 0x359)
        /data2/ds/local/go/src/runtime/mgcmark.go:138 +0x1f2
runtime.parfordo(0xc20801a060)
        /data2/ds/local/go/src/runtime/parfor.go:95 +0x12d
runtime.gcscan_m()
        /data2/ds/local/go/src/runtime/mgcmark.go:37 +0xe1
runtime.gc.func1()
        /data2/ds/local/go/src/runtime/mgc.go:829 +0x5f
runtime.systemstack(0xc208036000)
        /data2/ds/local/go/src/runtime/asm_amd64.s:262 +0x7c
runtime.mstart()
        /data2/ds/local/go/src/runtime/proc1.go:656

goroutine 82 [garbage collection scan]: 
runtime.systemstack_switch()
        /data2/ds/local/go/src/runtime/asm_amd64.s:216 fp=0xc2081d0590 sp=0xc2081d0588
runtime.gc(0x0)
        /data2/ds/local/go/src/runtime/mgc.go:841 +0x1dc fp=0xc2081d07b0 sp=0xc2081d0590
runtime.backgroundgc()
        /data2/ds/local/go/src/runtime/mgc.go:761 +0x40 fp=0xc2081d07e0 sp=0xc2081d07b0
runtime.goexit()
        /data2/ds/local/go/src/runtime/asm_amd64.s:1670 +0x1 fp=0xc2081d07e8 sp=0xc2081d07e0
created by runtime.startGC
        /data2/ds/local/go/src/runtime/mgc.go:734 +0x14a

$ go version
go version devel +8017ace Sat May 23 17:42:43 2015 +0000 linux/amd64

@methane
Copy link
Contributor Author

methane commented May 25, 2015

I reproduced it on 8core machine.
It takes 272min 43sec.

fatal error: unexpected signal during runtime execution
[signal 0xb code=0x1 addr=0x53 pc=0x417c0e]

goroutine 519 [running]:
runtime.throw(0x765d70, 0x2a)
        /Users/inada-n/local/go/src/runtime/panic.go:527 +0x96 fp=0xc20801f530 sp=0xc20801f518
runtime.sigpanic()
        /Users/inada-n/local/go/src/runtime/sigpanic_unix.go:12 +0x5d fp=0xc20801f580 sp=0xc20801f530
runtime.clearpools()
        /Users/inada-n/local/go/src/runtime/mgc.go:1439 +0xce fp=0xc20801f590 sp=0xc20801f580
runtime.gc(0x0)
        /Users/inada-n/local/go/src/runtime/mgc.go:812 +0x165 fp=0xc20801f7b0 sp=0xc20801f590
runtime.backgroundgc()
        /Users/inada-n/local/go/src/runtime/mgc.go:761 +0x40 fp=0xc20801f7e0 sp=0xc20801f7b0
runtime.goexit()
        /Users/inada-n/local/go/src/runtime/asm_amd64.s:1670 +0x1 fp=0xc20801f7e8 sp=0xc20801f7e0
created by runtime.startGC
        /Users/inada-n/local/go/src/runtime/mgc.go:734 +0x14a

@methane methane changed the title runtime: segmentation fault in malloc runtime: random fatal errors in runtime May 25, 2015
@methane
Copy link
Contributor Author

methane commented Jun 4, 2015

May I try to take a core dump?

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/10713 mentions this issue.

@methane
Copy link
Contributor Author

methane commented Jun 5, 2015

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/10791 mentions this issue.

@aclements
Copy link
Member

@methane, thanks for trying out CL 10713. Unfortunately, your reproduce didn't hit the path that CL 10713 adds debugging to. It looks like your original traceback would have, though. If you don't mind, could you try reproducing it a few more times to see if you can get the "free list corrupted" panic from CL 10713?

Even better, if you can reproduce it reliably enough, it would be great if you could try CL 10791.

methane pushed a commit to methane/go that referenced this issue Jun 6, 2015
Issues golang#10240, golang#10541, golang#10941, golang#11023, golang#11027 and possibly others are
indicating memory corruption in the runtime. One of the easiest places
to both get corruption and detect it is in the allocator's free lists
since they appear throughout memory and follow strict invariants. This
commit adds a check when sweeping a span that its free list is sane
and, if not, it prints the corrupted free list and panics. Hopefully
this will help us collect more information on these failures.

Change-Id: I6d417bcaeedf654943a5e068bd76b58bb02d4a64
@methane
Copy link
Contributor Author

methane commented Jun 6, 2015

@aclements OK, I've merged two CLs methane@3566a7d
and reproduce the problem with it.
https://gist.github.com/methane/56cc12a4030b493df82b#file-1-x2-txt

aclements added a commit that referenced this issue Jun 7, 2015
Stack barriers assume that writes through pointers to frames above the
current frame will get write barriers, and hence these frames do not
need to be re-scanned to pick up these changes. For normal writes,
this is true. However, there are places in the runtime that use
typedmemmove to potentially write through pointers to higher frames
(such as mapassign1). Currently, typedmemmove does not execute write
barriers if the destination is on the stack. If there's a stack
barrier between the current frame and the frame being modified with
typedmemmove, and the stack barrier is not otherwise hit, it's
possible that the garbage collector will never see the updated pointer
and incorrectly reclaim the object.

Fix this by making heapBitsBulkBarrier (which lies behind typedmemmove
and its variants) detect when the destination is in the stack and
unwind stack barriers up to the point, forcing mark termination to
later rescan the effected frame and collect these pointers.

Fixes #11084. Might be related to #10240, #10541, #10941, #11023,
 #11027 and possibly others.

Change-Id: I323d6cd0f1d29fa01f8fc946f4b90e04ef210efd
Reviewed-on: https://go-review.googlesource.com/10791
Reviewed-by: Russ Cox <[email protected]>
@dspezia
Copy link
Contributor

dspezia commented Jun 8, 2015

Following the sequence of CLs 10795, 10791, 10794, 10801 by @aclements, I have tried to reproduce this issue on a physical box. Actually I can reproduce with or without the CLs applied.

Without the CLs:
https://gist.github.com/dspezia/d7e44dd0f60a7d8aefc3#file-test1
https://gist.github.com/dspezia/d7e44dd0f60a7d8aefc3#file-test2

With the CLs:
https://gist.github.com/dspezia/d7e44dd0f60a7d8aefc3#file-test3
https://gist.github.com/dspezia/d7e44dd0f60a7d8aefc3#file-test4
https://gist.github.com/dspezia/d7e44dd0f60a7d8aefc3#file-test5

So it does not seem to help on this specific issue.

@eastany
Copy link

eastany commented Jun 9, 2015

I met the same thing~.
But build my project with go1.2, it works well.

runtime: g2143: leftover defer sp=0xdeaddeaddeaddead pc=0x48cc36
    defer 0xc20885ea80 sp=0xdeaddeaddeaddead pc=0x48cc36
fatal error: traceback has leftover defers

runtime stack:
runtime.throw(0x99add0, 0x1d)
    /home/work/go/src/runtime/panic.go:527 +0x96
runtime.gentraceback(0x42a799, 0xc208fea6c0, 0x0, 0xc208481400, 0x0, 0x0, 0x7fffffff, 0x7fa93cfb5c58, 0x0, 0x0, ...)
    /home/work/go/src/runtime/traceback.go:450 +0x7b0
runtime.scanstack(0xc208481400)
    /home/work/go/src/runtime/mgcmark.go:319 +0x28b
runtime.gcphasework(0xc208481400)
    /home/work/go/src/runtime/mgcmark.go:270 +0x7f
runtime.stopg(0xc208481400, 0x4)
    /home/work/go/src/runtime/proc1.go:395 +0x91
runtime.markroot(0xc20801a000, 0x3f)
    /home/work/go/src/runtime/mgcmark.go:138 +0x1f2
runtime.parfordo(0xc20801a000)
    /home/work/go/src/runtime/parfor.go:95 +0x12d
runtime.gcscan_m()
    /home/work/go/src/runtime/mgcmark.go:37 +0xde
runtime.gc.func1()
    /home/work/go/src/runtime/mgc.go:830 +0x5f
runtime.systemstack(0xc20801c000)
    /home/work/go/src/runtime/asm_amd64.s:262 +0x79
runtime.mstart()
    /home/work/go/src/runtime/proc1.go:719
go version devel +7eec656 Wed May 20 07:54:37 2015 +0000 linux/amd64

@methane
Copy link
Contributor Author

methane commented Jun 11, 2015

I find some fixes relating to GC are committed.
I'm running my code with a788c91.
It runs 30 minutes without crash. I'll post update about 12 hours later.

@methane
Copy link
Contributor Author

methane commented Jun 11, 2015

It was crashed. :-/

stkbarPos=0 len(stkbar)=1 goid=967 gcphase=2
fatal error: g already has stack barriers

runtime stack:
runtime.throw(0x751e10, 0x1c)
        /Users/inada-n/local/go/src/runtime/panic.go:527 +0x96
runtime.scanstack(0xc2088c6000)
        /Users/inada-n/local/go/src/runtime/mgcmark.go:334 +0x4f4
runtime.gcphasework(0xc2088c6000)
        /Users/inada-n/local/go/src/runtime/mgcmark.go:273 +0x6d
runtime.stopg(0xc2088c6000, 0x4)
        /Users/inada-n/local/go/src/runtime/proc1.go:395 +0x8e
runtime.markroot(0xc208018000, 0xc2000002e8)
        /Users/inada-n/local/go/src/runtime/mgcmark.go:138 +0x1ef
runtime.parfordo(0xc208018000)
        /Users/inada-n/local/go/src/runtime/parfor.go:95 +0x12a
runtime.gcscan_m()
        /Users/inada-n/local/go/src/runtime/mgcmark.go:37 +0xde
runtime.gc.func1()
        /Users/inada-n/local/go/src/runtime/mgc.go:876 +0x5c
runtime.systemstack(0xc20801a000)
        /Users/inada-n/local/go/src/runtime/asm_amd64.s:262 +0x7c
runtime.mstart()
        /Users/inada-n/local/go/src/runtime/proc1.go:656

goroutine 864 [garbage collection scan]:
runtime.systemstack_switch()
        /Users/inada-n/local/go/src/runtime/asm_amd64.s:216 fp=0xc2080d6ce8 sp=0xc2080d6ce0
runtime.gc(0x0)
        /Users/inada-n/local/go/src/runtime/mgc.go:887 +0x1d9 fp=0xc2080d6f90 sp=0xc2080d6ce8
runtime.backgroundgc()
        /Users/inada-n/local/go/src/runtime/mgc.go:783 +0x3d fp=0xc2080d6fc0 sp=0xc2080d6f90
runtime.goexit()
        /Users/inada-n/local/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc2080d6fc8 sp=0xc2080d6fc0
created by runtime.startGC
        /Users/inada-n/local/go/src/runtime/mgc.go:756 +0x147

@aclements
Copy link
Member

Thanks for running it again, @methane. This actually looks like good progress. Your crash is definitely interesting and I'll need to dig in to it, but it's almost certainly not memory corruption like your other crashes have been. If it's easy, can you continue stress testing and see if you get any other types of crashes?

@methane
Copy link
Contributor Author

methane commented Jun 14, 2015

I've patched throw() like this to see backtrace in gdb:

--- a/src/runtime/panic.go
+++ b/src/runtime/panic.go
@@ -519,11 +519,6 @@ func dopanic(unused int) {
 //go:nosplit
 func throw(s string) {
        print("fatal error: ", s, "\n")
-       gp := getg()
-       if gp.m.throwing == 0 {
-               gp.m.throwing = 1
-       }
-       startpanic()
-       dopanic(0)
+       crash()
        *(*int)(nil) = 0 // not reached
 }

And reproduced it (g already has stack barriers).

(gdb) p/x *gp
$6 = {stack = {lo = 0xc208d75000, hi = 0xc208d75fd0}, stackguard0 = 0xc208d75280, stackguard1 = 0xffffffffffffffff, _panic = 0x0, _defer = 0x0, stackAlloc = 0x1000, sched = {
    sp = 0xc208d75c50, pc = 0x42a4d9, g = 0xc2081f2580, ctxt = 0x0, ret = 0x0, lr = 0x0, bp = 0x2}, syscallsp = 0x0, syscallpc = 0x58cf75, stkbar = {array = 0xc208d75fd0,
    len = 0x1, cap = 0x3}, stkbarPos = 0x1, param = 0x0, atomicstatus = 0x1004, goid = 0x106, waitsince = 0x4a3fb6b9787544, waitreason = 0x70e7f0 "select",
  schedlink = 0xc2082dd600, preempt = 0x0, paniconfault = 0x0, preemptscan = 0x0, gcworkdone = 0x0, gcscanvalid = 0x0, throwsplit = 0x0, raceignore = 0x0, m = 0x0, lockedm = 0x0,
  sig = 0x0, writebuf = {array = 0x0, len = 0x0, cap = 0x0}, sigcode0 = 0x0, sigcode1 = 0x0, sigpc = 0x0, gopc = 0x480470, startpc = 0x483420, racectx = 0x0,
  waiting = 0xc20908e640, readyg = 0x0, gcalloc = 0x0, gcscanwork = 0x0}
(gdb) bt
#0  runtime.raise () at /Users/inada-n/local/go/src/runtime/sys_linux_amd64.s:104
#1  0x000000000043a977 in runtime.crash () at /Users/inada-n/local/go/src/runtime/signal1_unix.go:158
#2  0x0000000000427f8d in runtime.throw (s=...) at /Users/inada-n/local/go/src/runtime/panic.go:522
#3  0x00000000004192f4 in runtime.scanstack (gp=0xc2081f2580) at /Users/inada-n/local/go/src/runtime/mgcmark.go:334
#4  0x0000000000418dad in runtime.gcphasework (gp=0xc2081f2580) at /Users/inada-n/local/go/src/runtime/mgcmark.go:273
#5  0x000000000042c41e in runtime.stopg (gp=0xc2081f2580, ~r1=4) at /Users/inada-n/local/go/src/runtime/proc1.go:395
#6  0x000000000041883f in runtime.markroot (desc=0xc208018000, i=198) at /Users/inada-n/local/go/src/runtime/mgcmark.go:138
#7  0x0000000000428aea in runtime.parfordo (desc=0xc208018000) at /Users/inada-n/local/go/src/runtime/parfor.go:95
#8  0x000000000041858e in runtime.gcscan_m () at /Users/inada-n/local/go/src/runtime/mgcmark.go:37
#9  0x000000000044b70c in runtime.gc.func1 () at /Users/inada-n/local/go/src/runtime/mgc.go:876
#10 0x0000000000455c6c in runtime.systemstack () at /Users/inada-n/local/go/src/runtime/asm_amd64.s:262
#11 0x000000000042cd50 in runtime.startTheWorldWithSema () at /Users/inada-n/local/go/src/runtime/proc1.go:591
#12 0x000000000053e751 in net/textproto.(*Reader).readLineSlice (r=0x0, ~r0=..., ~r1=...) at /Users/inada-n/local/go/src/net/textproto/reader.go:55
#13 0x0000000000000000 in ?? ()

It seems the gp is not corrupted.

@methane
Copy link
Contributor Author

methane commented Jun 14, 2015

@aclements It seems it's harder to crash than before. It takes some hours.
I saw the crash only two times. And both crash was "g already has stack barriers".

I've started stress testing on temporary idle server in my company.
I'll use GCP Compute Engine or AWS EC2 in next week.

@methane
Copy link
Contributor Author

methane commented Jun 14, 2015

I reproduced crash several times. All of them are "g already has stack barriers".

grep stkbar shows:

x.txt:stkbarPos=0 len(stkbar)=1 goid=2105 gcphase=2
x.txt:stkbarPos=1 len(stkbar)=1 goid=169 gcphase=2
x.txt:stkbarPos=1 len(stkbar)=2 goid=318 gcphase=2
x.txt:stkbarPos=0 len(stkbar)=1 goid=1913 gcphase=2
x.txt:stkbarPos=1 len(stkbar)=1 goid=731 gcphase=2
x.txt:stkbarPos=0 len(stkbar)=1 goid=1299 gcphase=2
x.txt:stkbarPos=0 len(stkbar)=2 goid=176 gcphase=2
x.txt:stkbarPos=2 len(stkbar)=2 goid=551 gcphase=2
x.txt:stkbarPos=1 len(stkbar)=1 goid=282 gcphase=2
x.txt:stkbarPos=1 len(stkbar)=2 goid=638 gcphase=2
x.txt:stkbarPos=1 len(stkbar)=1 goid=1426 gcphase=2
stress.0:stkbarPos=1 len(stkbar)=1 goid=459 gcphase=2
stress.1:stkbarPos=0 len(stkbar)=1 goid=1052 gcphase=2
stress.2:stkbarPos=0 len(stkbar)=1 goid=304 gcphase=2
stress.3:stkbarPos=0 len(stkbar)=1 goid=745 gcphase=2
stress.4:stkbarPos=1 len(stkbar)=1 goid=1465 gcphase=2
x.txt:stkbarPos=1 len(stkbar)=2 goid=829 gcphase=2
x.txt:stkbarPos=1 len(stkbar)=1 goid=453 gcphase=2
x.txt:stkbarPos=1 len(stkbar)=1 goid=222 gcphase=2
x.txt:stkbarPos=0 len(stkbar)=1 goid=21 gcphase=2
x.txt:stkbarPos=0 len(stkbar)=1 goid=336 gcphase=2
stress.0:stkbarPos=1 len(stkbar)=1 goid=419 gcphase=2

@methane
Copy link
Contributor Author

methane commented Jun 14, 2015

It seems most of bugs causing runtime fatal error were fixed and one bug remains in concurrent GC?

@dspezia
Copy link
Contributor

dspezia commented Jun 14, 2015

Same here - On 10 runs, I can now only reproduce the "g already has stack barriers" error.

hello.log:stkbarPos=0 len(stkbar)=1 goid=1025 gcphase=2
hello.log2:stkbarPos=0 len(stkbar)=1 goid=1018 gcphase=2
hello.log3:stkbarPos=0 len(stkbar)=1 goid=1133 gcphase=2
hello.log4:stkbarPos=0 len(stkbar)=1 goid=985 gcphase=2
hello.log5:stkbarPos=0 len(stkbar)=1 goid=950 gcphase=2
hello.log6:stkbarPos=1 len(stkbar)=1 goid=1046 gcphase=2
hello.log7:stkbarPos=2 len(stkbar)=2 goid=272 gcphase=2
hello.log8:stkbarPos=1 len(stkbar)=2 goid=417 gcphase=2
hello.log9:stkbarPos=1 len(stkbar)=1 goid=301 gcphase=2
hello.log10:stkbarPos=0 len(stkbar)=1 goid=1039 gcphase=2

I got 7 failures in the GC, like https://gist.github.com/dspezia/8ebae9a48007de7028c6#file-gc1

I got 3 failures while growing stacks:
https://gist.github.com/dspezia/8ebae9a48007de7028c6#file-stack1
https://gist.github.com/dspezia/8ebae9a48007de7028c6#file-stack2
https://gist.github.com/dspezia/8ebae9a48007de7028c6#file-stack3

go version devel +6b24da6 Sun Jun 14 01:52:54 2015 +0000 linux/amd64

@aclements
Copy link
Member

Thanks @methane and @dspezia. It's clear that we have a race between the two places that trigger stack barrier insertion (which map exactly to the two classes of tracebacks you saw, @dspezia). I think at this point we've solved the original memory corruption issues. There's still one other related issue involving channels that has an outstanding CL, but it seems to be uninvolved in these tests. I'll dig in to the race when installing stack barriers.

If you're still stress testing, note that it's possible for this race to manifest as other failures. If we fail to catch the race at the "g already has stack barriers" point, we could get an index out of bounds in the runtime. I think it's also possible, though unlikely, for this race to cause the runtime to re-scan too little of the stack, which could cause a missed mark, which could manifest as memory corruption.

aclements added a commit that referenced this issue Jun 16, 2015
Issues #10240, #10541, #10941, #11023, #11027 and possibly others are
indicating memory corruption in the runtime. One of the easiest places
to both get corruption and detect it is in the allocator's free lists
since they appear throughout memory and follow strict invariants. This
commit adds a check when sweeping a span that its free list is sane
and, if not, it prints the corrupted free list and panics. Hopefully
this will help us collect more information on these failures.

Change-Id: I6d417bcaeedf654943a5e068bd76b58bb02d4a64
Reviewed-on: https://go-review.googlesource.com/10713
Reviewed-by: Keith Randall <[email protected]>
Reviewed-by: Russ Cox <[email protected]>
Run-TryBot: Austin Clements <[email protected]>
@methane
Copy link
Contributor Author

methane commented Jun 17, 2015

CL11089 may fixes this.
I started stress testing with the CL patch.

@methane
Copy link
Contributor Author

methane commented Jun 17, 2015

I cannot reproduce the crash anymore.
I'll close this issue when CL11089 is merged.
Good job, Go developers!

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/11089 mentions this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants