-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: incorrect frame information in traceback traversal may hang the process. #50772
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the report. Could you explain what function's frame size that you think is wrong? |
I'm not sure I understand this part: You said from the "correct" stack trace
which indicates the SP delta should be 0xffc41457b5b0-0xffc41457b560=0x50. But also
The SP delta seems to match above and therefore should be the correct value. Why does it get a different PC instead, which appears to be the memory content of 0xffc41457b5d0 = SP+0x70 ? |
Thanks for the quick response. Sorry, I messed up the first line of the SP and vDSO SP stack traces, I have fixed them now :/
I think |
Here is the disassembly (note that PC=0x48108d actually means PC=0x48108c, there's an off by one offset), some instructions before the stack frame is created:
|
Sorry, I still don't quite understand it. Maybe there are still some numbers swapped? You said
and also
which seems SP=0xffc41457b560 whereas VDSO SP=0xffc41457b580, which is the other way around compared to the first line. That said, I may have a guess what the problem is. As you mentioned, we do a traceback from SP first, if this fails, then do a traceback from VDSO SP. But at certain point in the VDSO calling code ( |
Is it hanging in the first call to gentraceback (with PC, SP from the signal handler), or the second (with VDSO PC and SP)? Also, could you disassemble the first a few instructions of |
I think the VDSO SP is wrong, and it is fixed in https://go-review.googlesource.com/c/go/+/337590 . But that CL is not applied to Go 1.17. Maybe we should. Thanks. |
You are absolutely right, sorry 🤦 It's SP (0xffc41457b560) and vDSO SP (0xffc41457b580), fixed 👍 |
I'ts hanging in the second (VDSO PC and SP).
Right, the SP delta 0x50 matches the machine code, but somehow a delta of 0x30 would be needed in the case of VDSO for some reason. Here's the disassembly:
|
That is because VDSO SP is wrong. See my comment at #50772 (comment) |
I think CL https://go-review.googlesource.com/c/go/+/337590 should fix this. As this is fixed at tip, I think we can close this and open backport issues for previous releases. In the meantime, you can try Go 1.18beta1 ( https://go.dev/dl/#go1.18beta1 ) or manually apply that patch to your toolchain, and let me know if this it still fails. Thanks. |
@gopherbot please backport this to previous releases. This may cause programs to hang. Thanks. |
Backport issue(s) opened: #50780 (for 1.16), #50781 (for 1.17). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases. |
Change https://golang.org/cl/380715 mentions this issue: |
Change https://golang.org/cl/380716 mentions this issue: |
Confirming that backporting this individual change into our go1.17.6 runtime fixes the process hangs we were observing. |
Thanks for confirming! @lizthegrey |
m.vdsoSP should be set to the SP of the caller of nanotime1, instead of the SP of nanotime1 itself, which matches m.vdsoPC. Otherwise the unmatched vdsoPC and vdsoSP would make the stack trace look like recursive. We already do it correctly on AMD64, 386, and RISCV64. This CL fixes the rest. Also incorporate CL 352509, skipping a flaky test. Updates #47324, #50772. Fixes #50780. Change-Id: I98b6fcfbe9fc6bdd28b8fe2a1299b7c505371dd4 Reviewed-on: https://go-review.googlesource.com/c/go/+/337590 Trust: Cherry Mui <[email protected]> Trust: Josh Bleecher Snyder <[email protected]> Reviewed-by: Josh Bleecher Snyder <[email protected]> (cherry picked from commit 217507e) Reviewed-on: https://go-review.googlesource.com/c/go/+/380716 Run-TryBot: Cherry Mui <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>
m.vdsoSP should be set to the SP of the caller of nanotime1, instead of the SP of nanotime1 itself, which matches m.vdsoPC. Otherwise the unmatched vdsoPC and vdsoSP would make the stack trace look like recursive. We already do it correctly on AMD64, 386, and RISCV64. This CL fixes the rest. Also incorporate CL 352509, skipping a flaky test. Updates #47324, #50772. Fixes #50781. Change-Id: I98b6fcfbe9fc6bdd28b8fe2a1299b7c505371dd4 Reviewed-on: https://go-review.googlesource.com/c/go/+/337590 Trust: Cherry Mui <[email protected]> Trust: Josh Bleecher Snyder <[email protected]> Reviewed-by: Josh Bleecher Snyder <[email protected]> (cherry picked from commit 217507e) Reviewed-on: https://go-review.googlesource.com/c/go/+/380715 Run-TryBot: Cherry Mui <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Michael Knyszek <[email protected]>
What version of Go are you using (
go version
)?go version go1.17.0 linux/arm64
Does this issue reproduce with the latest release?
It's been also been reproduced with
go version go1.17.6 linux/arm64
.What operating system and processor architecture are you using (
go env
)?Linux ARM64 (it could affect other architectures where
sys.MinFrameSize > 0
).What did you do?
Run a go program with profiling enabled on Linux/ARM64.
This may eventually (it may take from minutes to days) cause the process to hang.
I have a binary and a core dump that showcases the problem (I can share those but they're rather big -the core dump is 4.2GB-) but we have been unable to reproduce it since with this binary.
We have also been able to reproduce in production with private code on linux/arm64 with version go1.17.6 (more on this later).
We have tried hard and failed to generate a small reproducible program.
What did you expect to see?
The program should generate the stack traces and keep running normally.
What did you see instead?
The program hangs. This seems to be caused by a combination of an infinite loop in
runtime.gentraceback
function while there's astopTheWorld
happening. As a result, the single running goroutine keeps running forever.Here's the backtrace when the problem occurs:
Inside
sigprof
gentraceback
is called twice, the first time with the PC and SP arguments insigprof
which fails, the second time with the vSDO PC and SP.This seems to be the problem. If we take a look into the stack at the time:
We can see how the FP in both SP (0xffc41457b560) and vDSO SP (0xffc41457b580) point to the same next frame (0x0000ffc41457b5a8).
Following the FP would result in the correct stack trace. From SP:
PC=0x48108d (runtime.call2097152), SP=0xffc41457b560
PC=0x454178 (runtime.findrunnable), SP=0xffc41457b5b0
PC=0x455dec (runtime.schedule), SP=0xffc41457b600
PC=0x456540 (runtime.goschedImpl), SP=0xffc41457b6f0
PC=0x4566dc (runtime.gosched_m), SP=0xffc41457b750
PC=0x47fbb4 (runtime.mcall), SP=0xffc41457b790
From vDSO SP:
PC=0x456078 (runtime.checkTimers), SP=0xffc41457b580
PC=0x454178 (runtime.findrunnable), SP=0xffc41457b5b0
PC=0x455dec (runtime.schedule), SP=0xffc41457b600
PC=0x456540 (runtime.goschedImpl), SP=0xffc41457b6f0
PC=0x4566dc (runtime.gosched_m), SP=0xffc41457b750
PC=0x47fbb4 (runtime.mcall), SP=0xffc41457b790
Instead,
funcspdelta
is used which relays onpcsp
to retrieve the stack frame size.For PC=0x456078 this delta is 0x50, which is used to jump to PC=0x41f424 (
runtime.notewakeup
). I believe this is incorrect and the root cause of the bug since this results in an incorrect stack frame.Looking at
gentraceback
'spcbuf
contents we can see the following information:I'm unsure why
runtime.checkTimers
appears 4 times beforeruntime.notewakeup
, but it seems to confirm the stack trace corruption.After
runtime.notewakeup
the next frame has PC=0x48108d (runtime.call2097152
).At this point,
funcspdelta
returns 0, so in the traceback traversal SP = FP from this point forward (in x86, AMD64 and WASM this FP is increased just afterward, and thus the FP = SP invariant doesn't hold there and there's no infinite loop in these architectures).Additionally, when the frame being traversed is the frame of a wrapper function (
funcID == funcID_wrapper
), as it's in the case we are analyzing, the loop indexn
doesn't get incremented. Combining these two (SP = FP, n not incremented) results in an infinite loop.This bug has also happened in a production instance with go version 1.17.6 (also on linux/ARM64). We modified the runtime to:
With these changes we have been able to confirm the pattern described above (but failed to create a small reproducible case):
The text was updated successfully, but these errors were encountered: