runtime: nil pointer dereference in sigtrampgo #13363

aclements · 2015-11-22T21:42:45Z

While stress testing TestStackBarrierProfiling at 54bd5a7 on master, I got a segfault in sigtrampgo in signal_linux.go because g != nil, but g.m == nil.

I've saved the binary and core file. Here is some preliminary digging through the core:

Core was generated by `./pprof.test -test.run=TestStackBarrierProfiling'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000000000043de2e in runtime.sigtrampgo (sig=11, info=0xc820061bf0, 
    ctx=0xc820061ac0) at /home/austin/go.dev/src/runtime/signal_linux.go:20
20      setg(g.m.gsignal)
Loading Go Runtime support.
(gdb) bt
#0  0x000000000043de2e in runtime.sigtrampgo (sig=11, info=0xc820061bf0, 
    ctx=0xc820061ac0) at /home/austin/go.dev/src/runtime/signal_linux.go:20
#1  0x000000000045c93b in runtime.sigtramp ()
    at /home/austin/go.dev/src/runtime/sys_linux_amd64.s:234
#2  0x000000000045c940 in runtime.sigtramp ()
    at /home/austin/go.dev/src/runtime/sys_linux_amd64.s:235
#3  0x0000000000000001 in ?? ()
#4  0x0000000000000000 in ?? ()
(gdb) print/x *g
$5 = {stack = {lo = 0xc820092000, hi = 0xc820092800}, 
  stackguard0 = 0xfffffffffffffade, stackguard1 = 0xffffffffffffffff, 
  _panic = 0x0, _defer = 0x0, m = 0x0, stackAlloc = 0x1000, sched = {sp = 0x0, 
    pc = 0x459140, g = 0xc820001200, ctxt = 0x0, ret = 0x0, lr = 0x0, 
    bp = 0x4}, syscallsp = 0x0, syscallpc = 0x0, stkbar =  []runtime.stkbar, 
  stkbarPos = 0x0, stktopsp = 0xc8200927d8, param = 0x0, atomicstatus = 0x2, 
  stackLock = 0x0, goid = 0x6, waitsince = 0x0, 
  waitreason = "GC worker (idle)", schedlink = 0x0, preempt = 0x1, 
  paniconfault = 0x0, preemptscan = 0x0, gcscandone = 0x1, gcscanvalid = 0x0, 
  throwsplit = 0x0, raceignore = 0x0, sysblocktraced = 0x0, 
  sysexitticks = 0x0, sysexitseq = 0x0, lockedm = 0x0, sig = 0x0, 
  writebuf =  []uint8, sigcode0 = 0x0, sigcode1 = 0x0, sigpc = 0x0, 
  gopc = 0x418163, startpc = 0x4181d0, racectx = 0x0, waiting = 0x0, 
  gcAssistBytes = 0x0}
(gdb) x/i g.sched.pc
   0x459140 <runtime.systemstack_switch>:   retq   
(gdb) x/i g.gopc
   0x418163 <runtime.gcBgMarkStartWorkers+147>: 
    lea    0x2a4c96(%rip),%rbx        # 0x6bce00 <runtime.work>
(gdb) x/i g.startpc
   0x4181d0 <runtime.gcBgMarkWorker>:   mov    %fs:0xfffffffffffffff8,%rcx

(gdb) print/x *(struct 'runtime.sigcontext'*)(ctx+40)
$11 = {r8 = 0x0, r9 = 0x0, r10 = 0xc82003ef18, r11 = 0x206, r12 = 0x800, 
  r13 = 0x400, r14 = 0x9, r15 = 0x8, rdi = 0x6bcdb0, rsi = 0x0, 
  rbp = 0xc820026000, rbx = 0x0, rdx = 0x0, rax = 0xc820001200, 
  rcx = 0xc820001200, rsp = 0xc820092738, rip = 0x45920d, eflags = 0x10246, 
  cs = 0x33, gs = 0x0, fs = 0x0, __pad0 = 0x0, err = 0x4, trapno = 0xe, 
  oldmask = 0x0, cr2 = 0x0, fpstate = 0xc820061c80, __reserved1 = {0x0, 0x0, 
    0x0, 0x0, 0x0, 0x0, 0x0, 0x0}}
(gdb) x/i $11.rip
   0x45920d <runtime.morestack+13>: mov    (%rbx),%rsi
(gdb) x/21a $11.rsp
0xc820092738:   0x4189e2 <runtime.gcFlushGCWork+130>    0x417390 <runtime.gcMarkDone+560>
0xc820092748:   0x619060 <runtime.stopTheWorldWithSema.f>   0xffffffff00000001
0xc820092758:   0xffffffff00000001  0x4186a8 <runtime.gcBgMarkWorker+1240>
0xc820092768:   0x0 0x0
0xc820092778:   0xffffffff  0x10
0xc820092788:   0x14    0x0
0xc820092798:   0xffffffff01000000  0xfffffffe
0xc8200927a8:   0x2120869068188 0x0
0xc8200927b8:   0xc820034800    0xc820001200
0xc8200927c8:   0x618e78 <runtime.gcBgMarkWorker.func1.f>   0x45bba1 <runtime.goexit+1>
0xc8200927d8:   0xc820024a00
(gdb) disassemble 'runtime.gcBgMarkWorker'
   0x00000000004186a3 <+1235>:  callq  0x417160 <runtime.gcMarkDone>
   0x00000000004186a8 <+1240>:  mov    %fs:0xfffffffffffffff8,%rax
(gdb) disassemble 'runtime.gcMarkDone'
   0x000000000041738b <+555>:   callq  0x418960 <runtime.gcFlushGCWork>
   0x0000000000417390 <+560>:   callq  0x41a650 <runtime.gcWakeAllAssists>
(gdb) disassemble 'runtime.gcFlushGCWork'  
   0x00000000004189dd <+125>:   callq  0x459290 <runtime.morestack_noctxt>
   0x00000000004189e2 <+130>:   jmpq   0x418960 <runtime.gcFlushGCWork>

This appears to be a nested signal. The original signal was a SIGSEGV in morestack at MOVQ m_g0(BX), SI because BX (getg().m) is 0. The signal handler then also crashed for the same reason. We clearly tried to grow the stack in gcFlushGCWork (the stack is very small because this test runs in gcstackbarrierall mode), but I'm not sure why there wasn't an M at that point. It may be related to the fact that we've stopped the world in gcMarkDone at that point.

This is relatively easy to reproduce. It happened five times out of 3,000 stress runs on my workstation (which took ~25 minutes).

The text was updated successfully, but these errors were encountered:

aclements · 2015-11-23T04:57:13Z

I've confirmed that profiling signals are necessary to making this happen, so it's not simply an unrelated bug that happens to be tickled by this test.

gopherbot · 2015-11-23T17:02:33Z

CL https://golang.org/cl/17149 mentions this issue.

This improves the documentation comment on gcMarkDone, replaces a recursive call with a simple goto, and disables preemption before stopping the world in accordance with the documentation comment on stopTheWorldWithSema. Updates #13363, but, sadly, doesn't fix it. Change-Id: I6cb2a5836b35685bf82f7b1ce7e48a7625906656 Reviewed-on: https://go-review.googlesource.com/17149 Reviewed-by: Rick Hudson <[email protected]> Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>

aclements · 2015-12-15T20:29:38Z

No real progress, but I've tried a bunch of things.

First, it's still reproducible at current master (87d939d) with 5 failures out of 2933 runs.

It can be reproduced just as well without the extra fork/exec, which is useful for debugging: go test -c && stress sh -c "GODEBUG=gcstackbarrierall=1 GOGC=1 ./pprof.test -test.run TestStackBarrierProfiling"

I tried getting a branch trace record, but both techniques caused it to no longer reproduce even after >10,000 runs.

Using perf:

go test -c && stress sh -c 'GODEBUG=gcstackbarrierall=1 GOGC=1 perf record -e branches:u -c 1 -d -m 32M -o perf.data.$$ ./pprof.test -test.run TestStackBarrierProfiling && rm perf.data.$$'

Using gdb:

cat > gdbtest <<EOF
catch signal SIGSEGV
commands
set record instruction-history-size unlimited
#record function-call-history /ilc
set logging file history.txt
set logging on
record instruction-history
set logging off
!mv history.txt history-$$.txt
quit 1
end

start
record btrace
continue
quit
EOF
go test -c && stress sh -c 'GODEBUG=gcstackbarrierall=1 GOGC=1 gdb -x gdbtest --args ./pprof.test -test.run TestStackBarrierProfiling'

It does appear to be sensitive to the profiling rate. I increased it from 100 Hz to 500 Hz and it got ~3X more reproducible.

aclements · 2015-12-15T20:56:19Z

stackbarrierall=1 is not necessary to reproduce this failure, which means it's not just something funny that debug mode is doing: go test -c && stress sh -c "GODEBUG=gcstackbarrierall=1,gcstackbarrierall=0 GOGC=1 ./pprof.test -test.run TestStackBarrierProfiling" (the weird GODEBUG is to trick the test in to not re-exec'ing itself). Turning off this debug mode does appear to decrease the failure probability by 2–3X, though.

gopherbot · 2016-01-20T17:00:18Z

CL https://golang.org/cl/18761 mentions this issue.

aclements · 2016-01-20T17:09:33Z

I can still reproduce this on master (df2a9e4) with 4 failures out of 2617 runs (with the 500 Hz tweak). Now running a stress test with CL 18761 applied.

aclements · 2016-01-20T18:39:19Z

Over 10,000 runs with CL 18761 and no failures obviously related to this bug. I did get 7 timeouts. I'm trying now to make sure those weren't introduced by the CL.

aclements · 2016-01-20T23:57:01Z

The timeouts were almost certainly caused by the CL. The scheduler is getting stuck trying to cas a _Gdead mark worker from _Gwaiting to _Grunnable.

aclements added this to the Go1.6 milestone Nov 22, 2015

rsc mentioned this issue Nov 24, 2015

os/signal: document signal behavior #9896

Closed

rsc mentioned this issue Jan 14, 2016

runtime/pprof: TestStackBarrierProfiling fails with "SIGTRAP: trace trap PC=0x459cfd m=2" on linux #13864

Closed

rhysh mentioned this issue Jan 18, 2016

runtime: SIGSEGV while handling SIGSEGV #13978

Closed

gopherbot closed this as completed in eb3b183 Jan 27, 2016

golang locked and limited conversation to collaborators Feb 3, 2017

gopherbot added the FrozenDueToAge label Feb 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: nil pointer dereference in sigtrampgo #13363

runtime: nil pointer dereference in sigtrampgo #13363

aclements commented Nov 22, 2015

aclements commented Nov 23, 2015

gopherbot commented Nov 23, 2015

aclements commented Dec 15, 2015

aclements commented Dec 15, 2015

gopherbot commented Jan 20, 2016

aclements commented Jan 20, 2016

aclements commented Jan 20, 2016

aclements commented Jan 20, 2016

runtime: nil pointer dereference in sigtrampgo #13363

runtime: nil pointer dereference in sigtrampgo #13363

Comments

aclements commented Nov 22, 2015

aclements commented Nov 23, 2015

gopherbot commented Nov 23, 2015

aclements commented Dec 15, 2015

aclements commented Dec 15, 2015

gopherbot commented Jan 20, 2016

aclements commented Jan 20, 2016

aclements commented Jan 20, 2016

aclements commented Jan 20, 2016