runtime: fpTracebackPartialExpand SIGSEGV under high panic load #73664
Labels
BugReport
Issues describing a possible bug in the Go implementation.
compiler/runtime
Issues related to the Go compiler and/or runtime.
Go version
go version go1.23.8 linux/arm64
(also happens ongo version go1.23.8 linux/amd64
)Output of
go env
in your module/workspace:Note: this happened during an incident in prod, where I have little ability to run
go env
. I've included the output from my local as we configure our go env vars similarly between local and our deployed environment. Sorry – this is the best I can do with what I have currently.What did you do?
Going to explain as if I'm walking up the stack trace included below:
In our GraphQL service, which is gqlgen based, one of our resolvers started panic'ing due to a nil pointer error:
A deferred function we use for span collection ran immediately after
runtime.panicmem()
:Our datadog library was attempting to begin the process of "finish"ing the span associated with the resolver.
It attempted to collect a lock belonging to the span, which somehow invoked a frame walk, which ended up triggering a SIGSEGV.
For convenience, here's the following links that correspond to each call that was run in this process, starting from the DD library:
https://github.com/DataDog/dd-trace-go/blob/v1.999.0-rc.27/ddtrace/internal/v2.go#L157
https://github.com/DataDog/dd-trace-go/blob/v2.1.0-dev.1/ddtrace/tracer/span.go#L664
https://github.com/DataDog/dd-trace-go/blob/v2.1.0-dev.1/ddtrace/tracer/span.go#L730
https://github.com/DataDog/dd-trace-go/blob/v2.1.0-dev.1/ddtrace/tracer/spancontext.go#L301
https://github.com/DataDog/dd-trace-go/blob/v2.1.0-dev.1/ddtrace/tracer/spancontext.go#L499
https://cs.opensource.google/go/go/+/refs/tags/go1.24.3:src/internal/sync/mutex.go;l=149 (hm, didn’t exist at go 1.23.8?... strange. anyways, the method called is
runtime_SemacquireMutex
)https://github.com/golang/go/blob/go1.23.8/src/runtime/sema.go#L95
https://github.com/golang/go/blob/go1.23.8/src/runtime/sema.go#L194
https://github.com/golang/go/blob/go1.23.8/src/runtime/mprof.go#L513
https://github.com/golang/go/blob/go1.23.8/src/runtime/mprof.go#L563
https://github.com/golang/go/blob/go1.23.8/src/runtime/mprof.go#L592
https://cs.opensource.google/go/go/+/master:src/runtime/signal_unix.go;l=909?q=signal_unix.go:909&sq=&ss=go%2Fgo
Interestingly, it does actually seem like this happened non-deterministically. Some requests seemed to be able to make it to our panic recovery mechanisms. Others, however, SIGSEGV'd and crashed our containers.
We did see #69629, which seems on the surface to be highly related if not the same exact issue. But unfortunately the go-metro FP clobbering issue would not apply here, as go-metro was not invoked at runtime by any of our code. We don't see any other potential misbehaving assembly that could clobber the FP.
We are able to reproduce this in our dev environment, but sadly not locally. Let me know what further information would be helpful to provide.
Tagging @nsrip-dd as he seems to have extensive expertise in issues similar or related to this (e.g. #61766)
What did you see happen?
What did you expect to see?
No SIGSEGV.
The text was updated successfully, but these errors were encountered: