Skip to content

runtime: incorrect frame information in traceback traversal may hang the process. #50772

Closed
@abeaumont

Description

@abeaumont

What version of Go are you using (go version)?

go version go1.17.0 linux/arm64

Does this issue reproduce with the latest release?

It's been also been reproduced with go version go1.17.6 linux/arm64.

What operating system and processor architecture are you using (go env)?

Linux ARM64 (it could affect other architectures where sys.MinFrameSize > 0).

What did you do?

Run a go program with profiling enabled on Linux/ARM64.
This may eventually (it may take from minutes to days) cause the process to hang.
I have a binary and a core dump that showcases the problem (I can share those but they're rather big -the core dump is 4.2GB-) but we have been unable to reproduce it since with this binary.
We have also been able to reproduce in production with private code on linux/arm64 with version go1.17.6 (more on this later).
We have tried hard and failed to generate a small reproducible program.

What did you expect to see?

The program should generate the stack traces and keep running normally.

What did you see instead?

The program hangs. This seems to be caused by a combination of an infinite loop in runtime.gentraceback function while there's a stopTheWorld happening. As a result, the single running goroutine keeps running forever.

Here's the backtrace when the problem occurs:

gef➤  bt
#0  0x000000000046c52c in runtime.pcvalue (f=..., off=0x20830, targetpc=0x48108d, cache=0x401880e6f8, strict=0x1, ~r5=<optimized out>, ~r6=<optimized out>) at runtime/symtab.go:815
#1  0x000000000046ca60 in runtime.funcspdelta (f=..., targetpc=0x48108d, cache=0x401880e6f8, ~r3=<optimized out>) at runtime/symtab.go:957
#2  0x0000000000473c5c in runtime.gentraceback (pc0=0x456078, sp0=0xffc41457b580, lr0=0x0, gp=0x4000602b60, skip=0x0, pcbuf=0x401880e958, max=0x40, callback={void (runtime.stkframe *, void *, bool *)} 0x401880e930, v=0x0, flags=0x6, ~r10=<optimized out>) at runtime/traceback.go:207
#3  0x000000000045941c in runtime.sigprof (pc=0xffffab3f84b0, sp=0xffc41457b560, lr=0x482f7c, gp=0x4000602b60, mp=0x4000580c00) at runtime/proc.go:4755
#4  0x00000000004648a0 in runtime.sighandler (sig=0x1b, info=0x401880eda0, ctxt=0x401880ee20, gp=0x4000602b60) at runtime/signal_unix.go:542
#5  0x0000000000464058 in runtime.sigtrampgo (sig=0x1b, info=0x401880eda0, ctx=0x401880ee20) at runtime/signal_unix.go:460
#6  0x0000000000483114 in runtime.sigtramp () at runtime/sys_linux_arm64.s:458
#7  <signal handler called>
#8  0x0000ffffab3f84b0 in __kernel_clock_gettime ()
#9  0x0000000000482f7c in runtime.nanotime1 () at runtime/sys_linux_arm64.s:350
#10 0x000000000048108d in runtime.call2097152 () at runtime/asm_arm64.s:431
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Inside sigprof gentraceback is called twice, the first time with the PC and SP arguments in sigprof which fails, the second time with the vSDO PC and SP.
This seems to be the problem. If we take a look into the stack at the time:

gef➤  x/40gx 0xffc41457b560-0x8
0xffc41457b558: 0x0000ffc41457b5a8      0x000000000048108d
0xffc41457b568: 0x000000003178c8b9      0x000000000045a8d8
0xffc41457b578: 0x0000ffc41457b5a8      0x0000000000456078
0xffc41457b588: 0x0000000000000000      0x0000000000000000
0xffc41457b598: 0xffffffffffffffff      0x0000000000000000
0xffc41457b5a8: 0x0000ffc41457b5f8      0x0000000000454178
0xffc41457b5b8: 0x0000ffc41457b5f8      0x000000000045423c
0xffc41457b5c8: 0x0000ffc41457b628      0x000000000041f424
0xffc41457b5d8: 0x0010c763826714a8      0x000000000045425c
0xffc41457b5e8: 0x0010c763825a0ab9      0x0000004000001040
0xffc41457b5f8: 0x0000ffc41457b6e8      0x0000000000455dec
0xffc41457b608: 0x0000004000044000      0x0000000000000000
0xffc41457b618: 0x0000004000001040      0x000000000048108d
0xffc41457b628: 0x000000003178c90c      0x0000000000453570
0xffc41457b638: 0x0000ffc41457b668      0x00000000004509d8
0xffc41457b648: 0x0000000000000000      0x000000000048108d
0xffc41457b658: 0x000000003178cb6b      0x0010c763826714a8
0xffc41457b668: 0x0000ffc41457b698      0x0000000000456078
0xffc41457b678: 0x0000000000000000      0x0000000000000000
0xffc41457b688: 0x0000000000000001      0x0000004000580c00

We can see how the FP in both SP (0xffc41457b560) and vDSO SP (0xffc41457b580) point to the same next frame (0x0000ffc41457b5a8).
Following the FP would result in the correct stack trace. From SP:
PC=0x48108d (runtime.call2097152), SP=0xffc41457b560
PC=0x454178 (runtime.findrunnable), SP=0xffc41457b5b0
PC=0x455dec (runtime.schedule), SP=0xffc41457b600
PC=0x456540 (runtime.goschedImpl), SP=0xffc41457b6f0
PC=0x4566dc (runtime.gosched_m), SP=0xffc41457b750
PC=0x47fbb4 (runtime.mcall), SP=0xffc41457b790

From vDSO SP:
PC=0x456078 (runtime.checkTimers), SP=0xffc41457b580
PC=0x454178 (runtime.findrunnable), SP=0xffc41457b5b0
PC=0x455dec (runtime.schedule), SP=0xffc41457b600
PC=0x456540 (runtime.goschedImpl), SP=0xffc41457b6f0
PC=0x4566dc (runtime.gosched_m), SP=0xffc41457b750
PC=0x47fbb4 (runtime.mcall), SP=0xffc41457b790

Instead, funcspdelta is used which relays on pcsp to retrieve the stack frame size.
For PC=0x456078 this delta is 0x50, which is used to jump to PC=0x41f424 (runtime.notewakeup). I believe this is incorrect and the root cause of the bug since this results in an incorrect stack frame.

Looking at gentraceback's pcbuf contents we can see the following information:

gef➤  x/8gx 0x401880e958
0x401880e958:   0x0000000000456079      0x0000000000456071
0x401880e968:   0x0000000000456078      0x0000000000456071
0x401880e978:   0x000000000041f424      0x0000000000000000
0x401880e988:   0x0000000000000000      0x0000000000000000

I'm unsure why runtime.checkTimers appears 4 times before runtime.notewakeup, but it seems to confirm the stack trace corruption.
After runtime.notewakeup the next frame has PC=0x48108d (runtime.call2097152).
At this point, funcspdelta returns 0, so in the traceback traversal SP = FP from this point forward (in x86, AMD64 and WASM this FP is increased just afterward, and thus the FP = SP invariant doesn't hold there and there's no infinite loop in these architectures).

Additionally, when the frame being traversed is the frame of a wrapper function (funcID == funcID_wrapper), as it's in the case we are analyzing, the loop index n doesn't get incremented. Combining these two (SP = FP, n not incremented) results in an infinite loop.

This bug has also happened in a production instance with go version 1.17.6 (also on linux/ARM64). We modified the runtime to:

  • print some debugging information when we enter the infinite loop.
  • break the loop to prevent the process from hanging.

With these changes we have been able to confirm the pattern described above (but failed to create a small reproducible case):

#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x228e00, name=crypto/x509.(*SignatureAlgorithm).String, entry=0x228de0, nameoff=0x28e68, args=24, deferredreturn=0x0, pcsp=0x21753, pcfile=0x22d, pcln=0x230, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6

#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x2291c5, name=crypto/x509.(*ConstraintViolationError).Error, entry=0x2291b0, nameoff=0x28f14, args=24, deferredreturn=0x0, pcsp=0x26e8c, pcfile=0x25e3, pcln=0xbb8, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6

#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x228eea, name=crypto/x509.(*PublicKeyAlgorithm).String, entry=0x228e70, nameoff=0x28e91, args=24, deferredreturn=0x0, pcsp=0x85325, pcfile=0x85332, pcln=0x8533f, npcdata=0x3, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6

#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x22957f, name=crypto/x509.(*SystemRootsError).Unwrap, entry=0x229510, nameoff=0x28fda, args=24, deferredreturn=0x0, pcsp=0x21906, pcfile=0x25e3, pcln=0xbb8, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6

#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x2292d9, name=crypto/x509.(*HostnameError).Error, entry=0x2292c0, nameoff=0x28f65, args=24, deferredreturn=0x0, pcsp=0x2bcff, pcfile=0x208, pcln=0x20b, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6

#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x2291bc, name=crypto/x509.(*ConstraintViolationError).Error, entry=0x2291b0, nameoff=0x28f14, args=24, deferredreturn=0x0, pcsp=0x26e8c, pcfile=0x25e3, pcln=0xbb8, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6

#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x2296e7, name=crypto/x509.(*UnknownAuthorityError).Error, entry=0x2296b0, nameoff=0x2905c, args=24, deferredreturn=0x0, pcsp=0x217d9, pcfile=0x208, pcln=0x20b, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6

#0: pc=0x533e9, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#1: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#2: pc=0x533e8, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#3: pc=0x533e1, name=runtime.checkTimers, entry=0x53380, nameoff=0x5afb, args=40, deferredreturn=0x0, pcsp=0x144d7, pcfile=0x144ec, pcln=0x14500, npcdata=0x3, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#4: pc=0x1b054, name=runtime.notewakeup, entry=0x1aff0, nameoff=0x11da, args=8, deferredreturn=0x0, pcsp=0x2e18, pcfile=0x2e23, pcln=0x2e26, npcdata=0x2, cuOffset=0x15, funcID=0x0, flag=0x0, nfuncdata=0x6
#5: pc=0x228df9, name=crypto/x509.(*SignatureAlgorithm).String, entry=0x228de0, nameoff=0x28e68, args=24, deferredreturn=0x0, pcsp=0x21753, pcfile=0x22d, pcln=0x230, npcdata=0x2, cuOffset=0x9cd, funcID=0x16, flag=0x0, nfuncdata=0x6

Metadata

Metadata

Assignees

No one assigned

    Labels

    FrozenDueToAgeNeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions