-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: gentraceback() dead loop on arm64 casued the process hang #52116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you print the information of each unwind stack frame (pc, lr, sp, fp)? I looked at the code and found that the following two codes will cause Line 187 in 4aa1efe
frame.fp , if funcspdelta() returns 0, which cause frame.fp to be equal to frame.sp . Then hereLine 227 in 4aa1efe
frame.lr , which will always be equal to frame.pc .
Thank you. |
I'm not sure how to get it. Iterate the From my previous observation, We've reinstall the cluster to test something else, so the reproduction is not available now. |
Sorry for the long wait, now I get the reproduction, using 1.16.3 Here is how
and the value of the
|
The value in the
The call stack this time:
I don't know how to get each of the stack's frame info, I need to know the calling protocol for the stack layout.
|
Could you print the function names at the PC and LR? You could probably do |
|
The last stack frame |
Change https://go.dev/cl/400575 mentions this issue: |
Could you try if CL https://go.dev/cl/400575 fixes it? Thanks. |
Now we've tried that CL and watched it for 19h, no reproducing yet. |
Several days passed, and still no reproductiing. |
Will this fix be cherry-picked to 1.16? @cherrymui |
Will this fix be cherry-picked to 1.18? And when will we get the first version with this issue fixed? @cherrymui |
On LR machine, consider F calling G calling H, which grows stack. The stack looks like ... G's frame: ... locals ... saved LR = return PC in F <- SP points here at morestack H's frame (to be created) At morestack, we save gp.sched.pc = H's morestack call gp.sched.sp = H's entry SP (the arrow above) gp.sched.lr = return PC in G Currently, when unwinding through morestack (if _TraceJumpStack is set), we switch PC and SP but not LR. We then have frame.pc = H's morestack call frame.sp = H's entry SP (the arrow above) As LR is not set, we load it from stack at *sp, so frame.lr = return PC in F As the SP hasn't decremented at the morestack call, frame.fp = frame.sp = H's entry SP Unwinding a frame, we have frame.pc = old frame.lr = return PC in F frame.sp = old frame.fp = H's entry SP a.k.a. G's SP The PC and SP don't match. The unwinding will go off if F and G have different frame sizes. Fix this by preserving the LR when switching stack. Also add code to detect infinite loop in unwinding. TODO: add some test. I can reproduce the infinite loop (or throw with added check) but the frequency is low. May fix golang#52116.
On LR machine, consider F calling G calling H, which grows stack. The stack looks like ... G's frame: ... locals ... saved LR = return PC in F <- SP points here at morestack H's frame (to be created) At morestack, we save gp.sched.pc = H's morestack call gp.sched.sp = H's entry SP (the arrow above) gp.sched.lr = return PC in G Currently, when unwinding through morestack (if _TraceJumpStack is set), we switch PC and SP but not LR. We then have frame.pc = H's morestack call frame.sp = H's entry SP (the arrow above) As LR is not set, we load it from stack at *sp, so frame.lr = return PC in F As the SP hasn't decremented at the morestack call, frame.fp = frame.sp = H's entry SP Unwinding a frame, we have frame.pc = old frame.lr = return PC in F frame.sp = old frame.fp = H's entry SP a.k.a. G's SP The PC and SP don't match. The unwinding will go off if F and G have different frame sizes. Fix this by preserving the LR when switching stack. Also add code to detect infinite loop in unwinding. TODO: add some test. I can reproduce the infinite loop (or throw with added check) but the frequency is low. May fix golang#52116. Change-Id: I6e1294f1c6e55f664c962767a1cf6c466a0c0eff Reviewed-on: https://go-review.googlesource.com/c/go/+/400575 TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Cherry Mui <[email protected]> Reviewed-by: Eric Fang <[email protected]> Reviewed-by: Benny Siegert <[email protected]>
On LR machine, consider F calling G calling H, which grows stack. The stack looks like ... G's frame: ... locals ... saved LR = return PC in F <- SP points here at morestack H's frame (to be created) At morestack, we save gp.sched.pc = H's morestack call gp.sched.sp = H's entry SP (the arrow above) gp.sched.lr = return PC in G Currently, when unwinding through morestack (if _TraceJumpStack is set), we switch PC and SP but not LR. We then have frame.pc = H's morestack call frame.sp = H's entry SP (the arrow above) As LR is not set, we load it from stack at *sp, so frame.lr = return PC in F As the SP hasn't decremented at the morestack call, frame.fp = frame.sp = H's entry SP Unwinding a frame, we have frame.pc = old frame.lr = return PC in F frame.sp = old frame.fp = H's entry SP a.k.a. G's SP The PC and SP don't match. The unwinding will go off if F and G have different frame sizes. Fix this by preserving the LR when switching stack. Also add code to detect infinite loop in unwinding. TODO: add some test. I can reproduce the infinite loop (or throw with added check) but the frequency is low. May fix golang#52116. Change-Id: I6e1294f1c6e55f664c962767a1cf6c466a0c0eff Reviewed-on: https://go-review.googlesource.com/c/go/+/400575 TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Cherry Mui <[email protected]> Reviewed-by: Eric Fang <[email protected]> Reviewed-by: Benny Siegert <[email protected]>
We're also experiencing this issue and would appreciate a backport of this fix for Go 1.18. |
@gopherbot please backport this to previous releases. This is a runtime bug which can cause programs to hang or crash. Thanks. |
Backport issue(s) opened: #53111 (for 1.17), #53112 (for 1.18). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
Change https://go.dev/cl/408821 mentions this issue: |
Change https://go.dev/cl/408822 mentions this issue: |
Thanks for finding this, we also have been encountering it on go1.18.2 and were stumped because we'd also found a similar hang in go1.17.6 (#50772) but knew that one had been fixed already. |
…morestack On LR machine, consider F calling G calling H, which grows stack. The stack looks like ... G's frame: ... locals ... saved LR = return PC in F <- SP points here at morestack H's frame (to be created) At morestack, we save gp.sched.pc = H's morestack call gp.sched.sp = H's entry SP (the arrow above) gp.sched.lr = return PC in G Currently, when unwinding through morestack (if _TraceJumpStack is set), we switch PC and SP but not LR. We then have frame.pc = H's morestack call frame.sp = H's entry SP (the arrow above) As LR is not set, we load it from stack at *sp, so frame.lr = return PC in F As the SP hasn't decremented at the morestack call, frame.fp = frame.sp = H's entry SP Unwinding a frame, we have frame.pc = old frame.lr = return PC in F frame.sp = old frame.fp = H's entry SP a.k.a. G's SP The PC and SP don't match. The unwinding will go off if F and G have different frame sizes. Fix this by preserving the LR when switching stack. Also add code to detect infinite loop in unwinding. TODO: add some test. I can reproduce the infinite loop (or throw with added check) but the frequency is low. Fixes #53111. Updates #52116. Change-Id: I6e1294f1c6e55f664c962767a1cf6c466a0c0eff Reviewed-on: https://go-review.googlesource.com/c/go/+/400575 TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Cherry Mui <[email protected]> Reviewed-by: Eric Fang <[email protected]> Reviewed-by: Benny Siegert <[email protected]> (cherry picked from commit 74f0009) Reviewed-on: https://go-review.googlesource.com/c/go/+/408822 Reviewed-by: Austin Clements <[email protected]>
…morestack On LR machine, consider F calling G calling H, which grows stack. The stack looks like ... G's frame: ... locals ... saved LR = return PC in F <- SP points here at morestack H's frame (to be created) At morestack, we save gp.sched.pc = H's morestack call gp.sched.sp = H's entry SP (the arrow above) gp.sched.lr = return PC in G Currently, when unwinding through morestack (if _TraceJumpStack is set), we switch PC and SP but not LR. We then have frame.pc = H's morestack call frame.sp = H's entry SP (the arrow above) As LR is not set, we load it from stack at *sp, so frame.lr = return PC in F As the SP hasn't decremented at the morestack call, frame.fp = frame.sp = H's entry SP Unwinding a frame, we have frame.pc = old frame.lr = return PC in F frame.sp = old frame.fp = H's entry SP a.k.a. G's SP The PC and SP don't match. The unwinding will go off if F and G have different frame sizes. Fix this by preserving the LR when switching stack. Also add code to detect infinite loop in unwinding. TODO: add some test. I can reproduce the infinite loop (or throw with added check) but the frequency is low. Fixes #53112. Updates #52116. Change-Id: I6e1294f1c6e55f664c962767a1cf6c466a0c0eff Reviewed-on: https://go-review.googlesource.com/c/go/+/400575 TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Cherry Mui <[email protected]> Reviewed-by: Eric Fang <[email protected]> Reviewed-by: Benny Siegert <[email protected]> (cherry picked from commit 74f0009) Reviewed-on: https://go-review.googlesource.com/c/go/+/408821 Reviewed-by: Austin Clements <[email protected]>
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes, it reproduce in the latest Go1.18
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Run a benchmark workload in our database pingcap/tidb#31477
After we recently upgrade the toolchain to Go1.18, it reproduce stably.
Maybe it takes 6~8h or at most 24h, the tidb-server hang.
When that happen, the tidb-server CPU is 100% (one thread fully occupied) and it stop serving any request.
I use
gdb
anddlv
to debug it, and find that one thread is dead loop in thegentraceback()
function (which seems also holding a lock), and all the other threads are infutex()
function (blocked by that lock)I don't know how it can be reproduced in a minimal code snippet. Maybe I can provide the code dump of the process? but that would be too large.
Some more information I can provide:
The phenomenon looks like another issue #50772, but I've checked that fix is included in Go1.18, so it might be another new case.
More details:
The code doesn't enter this block
go/src/runtime/traceback.go
Lines 357 to 377 in 4aa1efe
Then the code run to this branch
go/src/runtime/traceback.go
Lines 379 to 380 in 4aa1efe
funcID
isfuncID_wrapper
Then line
go/src/runtime/traceback.go
Line 388 in 4aa1efe
and line
go/src/runtime/traceback.go
Line 458 in 4aa1efe
Note, after
n--
andn++
, the value ofn
is never changed, sofor n < max
can't break the loop.This code block will set frame to its upper frame, but
frame.fn
andflr
are the same one!go/src/runtime/traceback.go
Lines 480 to 486 in 4aa1efe
Thus it result in a dead loop.
I can workaround this bug by this patch tiancaiamao@5d1aea4
But I still can't figure out the root cause of the bug.
Print the stack when debugging in dlv:
In gdb:
The stack in
dlv
andgdb
doesn't looks exactly same, although the last frame are both ingentraceback
(or its children function).I try to get the stack information from the
pcbuf
to reason about the real stack, the pc address and symbol relation is just my guess.Anything else I can provide for you to debug it?
What did you expect to see?
No deadloop in
gentraceback()
caused server hang.What did you see instead?
gentraceback()
dead loop on arm64 casued the process hangThe text was updated successfully, but these errors were encountered: