Skip to content

runtime: nil pointer dereference in sigtrampgo #13363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aclements opened this issue Nov 22, 2015 · 8 comments
Closed

runtime: nil pointer dereference in sigtrampgo #13363

aclements opened this issue Nov 22, 2015 · 8 comments
Milestone

Comments

@aclements
Copy link
Member

While stress testing TestStackBarrierProfiling at 54bd5a7 on master, I got a segfault in sigtrampgo in signal_linux.go because g != nil, but g.m == nil.

I've saved the binary and core file. Here is some preliminary digging through the core:

Core was generated by `./pprof.test -test.run=TestStackBarrierProfiling'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000000000043de2e in runtime.sigtrampgo (sig=11, info=0xc820061bf0, 
    ctx=0xc820061ac0) at /home/austin/go.dev/src/runtime/signal_linux.go:20
20      setg(g.m.gsignal)
Loading Go Runtime support.
(gdb) bt
#0  0x000000000043de2e in runtime.sigtrampgo (sig=11, info=0xc820061bf0, 
    ctx=0xc820061ac0) at /home/austin/go.dev/src/runtime/signal_linux.go:20
#1  0x000000000045c93b in runtime.sigtramp ()
    at /home/austin/go.dev/src/runtime/sys_linux_amd64.s:234
#2  0x000000000045c940 in runtime.sigtramp ()
    at /home/austin/go.dev/src/runtime/sys_linux_amd64.s:235
#3  0x0000000000000001 in ?? ()
#4  0x0000000000000000 in ?? ()
(gdb) print/x *g
$5 = {stack = {lo = 0xc820092000, hi = 0xc820092800}, 
  stackguard0 = 0xfffffffffffffade, stackguard1 = 0xffffffffffffffff, 
  _panic = 0x0, _defer = 0x0, m = 0x0, stackAlloc = 0x1000, sched = {sp = 0x0, 
    pc = 0x459140, g = 0xc820001200, ctxt = 0x0, ret = 0x0, lr = 0x0, 
    bp = 0x4}, syscallsp = 0x0, syscallpc = 0x0, stkbar =  []runtime.stkbar, 
  stkbarPos = 0x0, stktopsp = 0xc8200927d8, param = 0x0, atomicstatus = 0x2, 
  stackLock = 0x0, goid = 0x6, waitsince = 0x0, 
  waitreason = "GC worker (idle)", schedlink = 0x0, preempt = 0x1, 
  paniconfault = 0x0, preemptscan = 0x0, gcscandone = 0x1, gcscanvalid = 0x0, 
  throwsplit = 0x0, raceignore = 0x0, sysblocktraced = 0x0, 
  sysexitticks = 0x0, sysexitseq = 0x0, lockedm = 0x0, sig = 0x0, 
  writebuf =  []uint8, sigcode0 = 0x0, sigcode1 = 0x0, sigpc = 0x0, 
  gopc = 0x418163, startpc = 0x4181d0, racectx = 0x0, waiting = 0x0, 
  gcAssistBytes = 0x0}
(gdb) x/i g.sched.pc
   0x459140 <runtime.systemstack_switch>:   retq   
(gdb) x/i g.gopc
   0x418163 <runtime.gcBgMarkStartWorkers+147>: 
    lea    0x2a4c96(%rip),%rbx        # 0x6bce00 <runtime.work>
(gdb) x/i g.startpc
   0x4181d0 <runtime.gcBgMarkWorker>:   mov    %fs:0xfffffffffffffff8,%rcx

(gdb) print/x *(struct 'runtime.sigcontext'*)(ctx+40)
$11 = {r8 = 0x0, r9 = 0x0, r10 = 0xc82003ef18, r11 = 0x206, r12 = 0x800, 
  r13 = 0x400, r14 = 0x9, r15 = 0x8, rdi = 0x6bcdb0, rsi = 0x0, 
  rbp = 0xc820026000, rbx = 0x0, rdx = 0x0, rax = 0xc820001200, 
  rcx = 0xc820001200, rsp = 0xc820092738, rip = 0x45920d, eflags = 0x10246, 
  cs = 0x33, gs = 0x0, fs = 0x0, __pad0 = 0x0, err = 0x4, trapno = 0xe, 
  oldmask = 0x0, cr2 = 0x0, fpstate = 0xc820061c80, __reserved1 = {0x0, 0x0, 
    0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}
(gdb) x/i $11.rip
   0x45920d <runtime.morestack+13>: mov    (%rbx),%rsi
(gdb) x/21a $11.rsp
0xc820092738:   0x4189e2 <runtime.gcFlushGCWork+130>    0x417390 <runtime.gcMarkDone+560>
0xc820092748:   0x619060 <runtime.stopTheWorldWithSema.f>   0xffffffff00000001
0xc820092758:   0xffffffff00000001  0x4186a8 <runtime.gcBgMarkWorker+1240>
0xc820092768:   0x0 0x0
0xc820092778:   0xffffffff  0x10
0xc820092788:   0x14    0x0
0xc820092798:   0xffffffff01000000  0xfffffffe
0xc8200927a8:   0x2120869068188 0x0
0xc8200927b8:   0xc820034800    0xc820001200
0xc8200927c8:   0x618e78 <runtime.gcBgMarkWorker.func1.f>   0x45bba1 <runtime.goexit+1>
0xc8200927d8:   0xc820024a00
(gdb) disassemble 'runtime.gcBgMarkWorker'
   0x00000000004186a3 <+1235>:  callq  0x417160 <runtime.gcMarkDone>
   0x00000000004186a8 <+1240>:  mov    %fs:0xfffffffffffffff8,%rax
(gdb) disassemble 'runtime.gcMarkDone'
   0x000000000041738b <+555>:   callq  0x418960 <runtime.gcFlushGCWork>
   0x0000000000417390 <+560>:   callq  0x41a650 <runtime.gcWakeAllAssists>
(gdb) disassemble 'runtime.gcFlushGCWork'  
   0x00000000004189dd <+125>:   callq  0x459290 <runtime.morestack_noctxt>
   0x00000000004189e2 <+130>:   jmpq   0x418960 <runtime.gcFlushGCWork>

This appears to be a nested signal. The original signal was a SIGSEGV in morestack at MOVQ m_g0(BX), SI because BX (getg().m) is 0. The signal handler then also crashed for the same reason. We clearly tried to grow the stack in gcFlushGCWork (the stack is very small because this test runs in gcstackbarrierall mode), but I'm not sure why there wasn't an M at that point. It may be related to the fact that we've stopped the world in gcMarkDone at that point.

This is relatively easy to reproduce. It happened five times out of 3,000 stress runs on my workstation (which took ~25 minutes).

@aclements aclements added this to the Go1.6 milestone Nov 22, 2015
@aclements
Copy link
Member Author

I've confirmed that profiling signals are necessary to making this happen, so it's not simply an unrelated bug that happens to be tickled by this test.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/17149 mentions this issue.

aclements added a commit that referenced this issue Nov 23, 2015
This improves the documentation comment on gcMarkDone, replaces a
recursive call with a simple goto, and disables preemption before
stopping the world in accordance with the documentation comment on
stopTheWorldWithSema.

Updates #13363, but, sadly, doesn't fix it.

Change-Id: I6cb2a5836b35685bf82f7b1ce7e48a7625906656
Reviewed-on: https://go-review.googlesource.com/17149
Reviewed-by: Rick Hudson <[email protected]>
Run-TryBot: Austin Clements <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
@aclements
Copy link
Member Author

No real progress, but I've tried a bunch of things.

First, it's still reproducible at current master (87d939d) with 5 failures out of 2933 runs.

It can be reproduced just as well without the extra fork/exec, which is useful for debugging: go test -c && stress sh -c "GODEBUG=gcstackbarrierall=1 GOGC=1 ./pprof.test -test.run TestStackBarrierProfiling"

I tried getting a branch trace record, but both techniques caused it to no longer reproduce even after >10,000 runs.

Using perf:

go test -c && stress sh -c 'GODEBUG=gcstackbarrierall=1 GOGC=1 perf record -e branches:u -c 1 -d -m 32M -o perf.data.$$ ./pprof.test -test.run TestStackBarrierProfiling && rm perf.data.$$'

Using gdb:

cat > gdbtest <<EOF
catch signal SIGSEGV
commands
set record instruction-history-size unlimited
#record function-call-history /ilc
set logging file history.txt
set logging on
record instruction-history
set logging off
!mv history.txt history-$$.txt
quit 1
end

start
record btrace
continue
quit
EOF
go test -c && stress sh -c 'GODEBUG=gcstackbarrierall=1 GOGC=1 gdb -x gdbtest --args ./pprof.test -test.run TestStackBarrierProfiling'

It does appear to be sensitive to the profiling rate. I increased it from 100 Hz to 500 Hz and it got ~3X more reproducible.

@aclements
Copy link
Member Author

stackbarrierall=1 is not necessary to reproduce this failure, which means it's not just something funny that debug mode is doing: go test -c && stress sh -c "GODEBUG=gcstackbarrierall=1,gcstackbarrierall=0 GOGC=1 ./pprof.test -test.run TestStackBarrierProfiling" (the weird GODEBUG is to trick the test in to not re-exec'ing itself). Turning off this debug mode does appear to decrease the failure probability by 2–3X, though.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/18761 mentions this issue.

@aclements
Copy link
Member Author

I can still reproduce this on master (df2a9e4) with 4 failures out of 2617 runs (with the 500 Hz tweak). Now running a stress test with CL 18761 applied.

@aclements
Copy link
Member Author

Over 10,000 runs with CL 18761 and no failures obviously related to this bug. I did get 7 timeouts. I'm trying now to make sure those weren't introduced by the CL.

@aclements
Copy link
Member Author

The timeouts were almost certainly caused by the CL. The scheduler is getting stuck trying to cas a _Gdead mark worker from _Gwaiting to _Grunnable.

@golang golang locked and limited conversation to collaborators Feb 3, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants