-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: random fatal errors in runtime #10941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I can reproduce it with more simple code: https://gist.github.com/methane/7f0e381b6bde87dd8469#file-hello-go $ ~/local/go-dev/bin/go version
go version devel +8017ace Sat May 23 17:42:43 2015 +0000 linux/amd64
$ ~/local/go-dev/bin/go build hello.go
$ GOMAXPROCS=16 ./hello
$ wrk -t4 -c800 http://127.0.0.1:8080/ -d10000 OS is Linux amd64 |
But this time I got different crash message:
https://gist.github.com/methane/7f0e381b6bde87dd8469#file-crashdump |
Thanks for your report. I have not been able to reproduced either of these crashes yet. % go version Are you able to reproduce the crash outside the ec2 environment ? |
I think more concurrency makes easy to reproduce it. |
I can't reproduce any crashes either. Even on a machine with 16 cores.
$ go version
go version devel +8017ace Sat May 23 17:42:43 2015 +0000 linux/amd64
|
I cannot reproduce on small VMs. Will try later on larger ones. $ go version |
I reproduced it on an EC2 instance with 36 vCPUs (and GOMAXPROCS=36), after 10 min of traffic injection at more than 400K r/s (about 200M requests).
$ go version |
When I'm lucky (?), I can reproduce it in 1m30sec. |
https://go-review.googlesource.com/#/c/9373/ Are folks using the stress tool? I have found it makes me lucky... On Sun, May 24, 2015 at 5:49 AM, INADA Naoki [email protected]
|
@RLH Is it use processes to run parallel test? Here is new code to reproduce. This doesn't require wrk. I have reproduced once on c4.4xlarge (16cores) within 1 hour. fatal error: acquireSudog: found s.elem != nil in cache
goroutine 55 [running]:
runtime.throw(0x75d970, 0x2a)
/Users/inada-n/local/go/src/runtime/panic.go:527 +0x96 fp=0xc208d8e5a8 sp=0xc208d8e590
runtime.acquireSudog(0xc208e03700)
/Users/inada-n/local/go/src/runtime/proc.go:232 +0x332 fp=0xc208d8e640 sp=0xc208d8e5a8
runtime.selectgoImpl(0xc208d8e938, 0x0, 0x18)
/Users/inada-n/local/go/src/runtime/select.go:369 +0x889 fp=0xc208d8e7e8 sp=0xc208d8e640
runtime.selectgo(0xc208d8e938)
/Users/inada-n/local/go/src/runtime/select.go:212 +0x12 fp=0xc208d8e808 sp=0xc208d8e7e8
net/http.(*persistConn).roundTrip(0xc2080fe000, 0xc2085da420, 0x0, 0x0, 0x0)
/Users/inada-n/local/go/src/net/http/transport.go:1165 +0x8e1 fp=0xc208d8ea40 sp=0xc208d8e808
net/http.(*Transport).RoundTrip(0xc20806a120, 0xc208100f70, 0xc207ff2546, 0x0, 0x0)
/Users/inada-n/local/go/src/net/http/transport.go:235 +0x533 fp=0xc208d8eb60 sp=0xc208d8ea40
net/http.send(0xc208100f70, 0x7fdfb4a2b468, 0xc20806a120, 0x16, 0x0, 0x0)
/Users/inada-n/local/go/src/net/http/client.go:220 +0x4e1 fp=0xc208d8ec58 sp=0xc208d8eb60
net/http.(*Client).send(0x87cd20, 0xc208100f70, 0x16, 0x0, 0x0)
/Users/inada-n/local/go/src/net/http/client.go:143 +0x15d fp=0xc208d8ed28 sp=0xc208d8ec58
net/http.(*Client).doFollowingRedirects(0x87cd20, 0xc208100f70, 0x78d1d8, 0x0, 0x0, 0x0)
/Users/inada-n/local/go/src/net/http/client.go:380 +0xbc3 fp=0xc208d8ef28 sp=0xc208d8ed28
net/http.(*Client).Get(0x87cd20, 0xc208010540, 0x16, 0xd, 0x0, 0x0)
/Users/inada-n/local/go/src/net/http/client.go:306 +0xad fp=0xc208d8ef78 sp=0xc208d8ef28
main.client()
/Users/inada-n/work/hellobench.go/hello/repro.go:43 +0x44 fp=0xc208d8efe0 sp=0xc208d8ef78
runtime.goexit()
/Users/inada-n/local/go/src/runtime/asm_amd64.s:1670 +0x1 fp=0xc208d8efe8 sp=0xc208d8efe0
created by main.main
/Users/inada-n/work/hellobench.go/hello/repro.go:34 +0x122
goroutine 1 [sleep]:
time.Sleep(0x3b9aca00)
/Users/inada-n/local/go/src/runtime/time.go:59 +0xfc
main.main()
/Users/inada-n/work/hellobench.go/hello/repro.go:37 +0x142
goroutine 5 [IO wait, 46 minutes]:
net.runtime_pollWait(0x7fdfb4a30590, 0x72, 0xc20800a1a0)
/Users/inada-n/local/go/src/runtime/netpoll.go:157 +0x63
net.(*pollDesc).Wait(0xc208042140, 0x72, 0x0, 0x0)
... It was cross compiled on Mac. $ ~/local/go/bin/go version
go version devel +8017ace Sat May 23 17:42:43 2015 +0000 darwin/amd64 |
I have reproduced it with wrk on a physical box (32 HW threads, GOMAXPROCS=32), running SLES11SP2, in 15 minutes. The problem does not only occur on AWS EC2 instances.
$ go version |
I reproduced it on 8core machine. fatal error: unexpected signal during runtime execution
[signal 0xb code=0x1 addr=0x53 pc=0x417c0e]
goroutine 519 [running]:
runtime.throw(0x765d70, 0x2a)
/Users/inada-n/local/go/src/runtime/panic.go:527 +0x96 fp=0xc20801f530 sp=0xc20801f518
runtime.sigpanic()
/Users/inada-n/local/go/src/runtime/sigpanic_unix.go:12 +0x5d fp=0xc20801f580 sp=0xc20801f530
runtime.clearpools()
/Users/inada-n/local/go/src/runtime/mgc.go:1439 +0xce fp=0xc20801f590 sp=0xc20801f580
runtime.gc(0x0)
/Users/inada-n/local/go/src/runtime/mgc.go:812 +0x165 fp=0xc20801f7b0 sp=0xc20801f590
runtime.backgroundgc()
/Users/inada-n/local/go/src/runtime/mgc.go:761 +0x40 fp=0xc20801f7e0 sp=0xc20801f7b0
runtime.goexit()
/Users/inada-n/local/go/src/runtime/asm_amd64.s:1670 +0x1 fp=0xc20801f7e8 sp=0xc20801f7e0
created by runtime.startGC
/Users/inada-n/local/go/src/runtime/mgc.go:734 +0x14a |
May I try to take a core dump? |
CL https://golang.org/cl/10713 mentions this issue. |
I merged CL https://golang.org/cl/10713 and reproduce it: |
CL https://golang.org/cl/10791 mentions this issue. |
@methane, thanks for trying out CL 10713. Unfortunately, your reproduce didn't hit the path that CL 10713 adds debugging to. It looks like your original traceback would have, though. If you don't mind, could you try reproducing it a few more times to see if you can get the "free list corrupted" panic from CL 10713? Even better, if you can reproduce it reliably enough, it would be great if you could try CL 10791. |
Issues golang#10240, golang#10541, golang#10941, golang#11023, golang#11027 and possibly others are indicating memory corruption in the runtime. One of the easiest places to both get corruption and detect it is in the allocator's free lists since they appear throughout memory and follow strict invariants. This commit adds a check when sweeping a span that its free list is sane and, if not, it prints the corrupted free list and panics. Hopefully this will help us collect more information on these failures. Change-Id: I6d417bcaeedf654943a5e068bd76b58bb02d4a64
@aclements OK, I've merged two CLs methane@3566a7d |
Stack barriers assume that writes through pointers to frames above the current frame will get write barriers, and hence these frames do not need to be re-scanned to pick up these changes. For normal writes, this is true. However, there are places in the runtime that use typedmemmove to potentially write through pointers to higher frames (such as mapassign1). Currently, typedmemmove does not execute write barriers if the destination is on the stack. If there's a stack barrier between the current frame and the frame being modified with typedmemmove, and the stack barrier is not otherwise hit, it's possible that the garbage collector will never see the updated pointer and incorrectly reclaim the object. Fix this by making heapBitsBulkBarrier (which lies behind typedmemmove and its variants) detect when the destination is in the stack and unwind stack barriers up to the point, forcing mark termination to later rescan the effected frame and collect these pointers. Fixes #11084. Might be related to #10240, #10541, #10941, #11023, #11027 and possibly others. Change-Id: I323d6cd0f1d29fa01f8fc946f4b90e04ef210efd Reviewed-on: https://go-review.googlesource.com/10791 Reviewed-by: Russ Cox <[email protected]>
Following the sequence of CLs 10795, 10791, 10794, 10801 by @aclements, I have tried to reproduce this issue on a physical box. Actually I can reproduce with or without the CLs applied. Without the CLs: With the CLs: So it does not seem to help on this specific issue. |
I met the same thing~.
|
I find some fixes relating to GC are committed. |
It was crashed. :-/
|
Thanks for running it again, @methane. This actually looks like good progress. Your crash is definitely interesting and I'll need to dig in to it, but it's almost certainly not memory corruption like your other crashes have been. If it's easy, can you continue stress testing and see if you get any other types of crashes? |
I've patched throw() like this to see backtrace in gdb: --- a/src/runtime/panic.go
+++ b/src/runtime/panic.go
@@ -519,11 +519,6 @@ func dopanic(unused int) {
//go:nosplit
func throw(s string) {
print("fatal error: ", s, "\n")
- gp := getg()
- if gp.m.throwing == 0 {
- gp.m.throwing = 1
- }
- startpanic()
- dopanic(0)
+ crash()
*(*int)(nil) = 0 // not reached
} And reproduced it (
It seems the |
@aclements It seems it's harder to crash than before. It takes some hours. I've started stress testing on temporary idle server in my company. |
I reproduced crash several times. All of them are "g already has stack barriers". grep stkbar shows:
|
It seems most of bugs causing runtime fatal error were fixed and one bug remains in concurrent GC? |
Same here - On 10 runs, I can now only reproduce the "g already has stack barriers" error.
I got 7 failures in the GC, like https://gist.github.com/dspezia/8ebae9a48007de7028c6#file-gc1 I got 3 failures while growing stacks: go version devel +6b24da6 Sun Jun 14 01:52:54 2015 +0000 linux/amd64 |
Thanks @methane and @dspezia. It's clear that we have a race between the two places that trigger stack barrier insertion (which map exactly to the two classes of tracebacks you saw, @dspezia). I think at this point we've solved the original memory corruption issues. There's still one other related issue involving channels that has an outstanding CL, but it seems to be uninvolved in these tests. I'll dig in to the race when installing stack barriers. If you're still stress testing, note that it's possible for this race to manifest as other failures. If we fail to catch the race at the "g already has stack barriers" point, we could get an index out of bounds in the runtime. I think it's also possible, though unlikely, for this race to cause the runtime to re-scan too little of the stack, which could cause a missed mark, which could manifest as memory corruption. |
Issues #10240, #10541, #10941, #11023, #11027 and possibly others are indicating memory corruption in the runtime. One of the easiest places to both get corruption and detect it is in the allocator's free lists since they appear throughout memory and follow strict invariants. This commit adds a check when sweeping a span that its free list is sane and, if not, it prints the corrupted free list and panics. Hopefully this will help us collect more information on these failures. Change-Id: I6d417bcaeedf654943a5e068bd76b58bb02d4a64 Reviewed-on: https://go-review.googlesource.com/10713 Reviewed-by: Keith Randall <[email protected]> Reviewed-by: Russ Cox <[email protected]> Run-TryBot: Austin Clements <[email protected]>
CL11089 may fixes this. |
I cannot reproduce the crash anymore. |
CL https://golang.org/cl/11089 mentions this issue. |
Version: tip 8017ace
Program: https://github.com/methane/hellobench.go/blob/master/hello.go
Stack dump: https://gist.github.com/methane/9701f5c7f58e2d701b65
To reproduce:
I've encountered this problem two times.
But I forgot to get core file.
Is this known bug?
The text was updated successfully, but these errors were encountered: