-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: nil pointer dereference in sigtrampgo #13363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I've confirmed that profiling signals are necessary to making this happen, so it's not simply an unrelated bug that happens to be tickled by this test. |
CL https://golang.org/cl/17149 mentions this issue. |
This improves the documentation comment on gcMarkDone, replaces a recursive call with a simple goto, and disables preemption before stopping the world in accordance with the documentation comment on stopTheWorldWithSema. Updates #13363, but, sadly, doesn't fix it. Change-Id: I6cb2a5836b35685bf82f7b1ce7e48a7625906656 Reviewed-on: https://go-review.googlesource.com/17149 Reviewed-by: Rick Hudson <[email protected]> Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>
No real progress, but I've tried a bunch of things. First, it's still reproducible at current master (87d939d) with 5 failures out of 2933 runs. It can be reproduced just as well without the extra fork/exec, which is useful for debugging: I tried getting a branch trace record, but both techniques caused it to no longer reproduce even after >10,000 runs. Using perf:
Using gdb:
It does appear to be sensitive to the profiling rate. I increased it from 100 Hz to 500 Hz and it got ~3X more reproducible. |
stackbarrierall=1 is not necessary to reproduce this failure, which means it's not just something funny that debug mode is doing: |
CL https://golang.org/cl/18761 mentions this issue. |
I can still reproduce this on master (df2a9e4) with 4 failures out of 2617 runs (with the 500 Hz tweak). Now running a stress test with CL 18761 applied. |
Over 10,000 runs with CL 18761 and no failures obviously related to this bug. I did get 7 timeouts. I'm trying now to make sure those weren't introduced by the CL. |
The timeouts were almost certainly caused by the CL. The scheduler is getting stuck trying to cas a _Gdead mark worker from _Gwaiting to _Grunnable. |
While stress testing TestStackBarrierProfiling at 54bd5a7 on master, I got a segfault in
sigtrampgo
in signal_linux.go because g != nil, but g.m == nil.I've saved the binary and core file. Here is some preliminary digging through the core:
This appears to be a nested signal. The original signal was a SIGSEGV in morestack at
MOVQ m_g0(BX), SI
because BX (getg().m) is 0. The signal handler then also crashed for the same reason. We clearly tried to grow the stack in gcFlushGCWork (the stack is very small because this test runs in gcstackbarrierall mode), but I'm not sure why there wasn't an M at that point. It may be related to the fact that we've stopped the world in gcMarkDone at that point.This is relatively easy to reproduce. It happened five times out of 3,000 stress runs on my workstation (which took ~25 minutes).
The text was updated successfully, but these errors were encountered: