Skip to content

runtime: fpTracebackPartialExpand SIGSEGV under high panic load #73664

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
john-markham opened this issue May 10, 2025 · 1 comment
Open

runtime: fpTracebackPartialExpand SIGSEGV under high panic load #73664

john-markham opened this issue May 10, 2025 · 1 comment
Labels
BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime.

Comments

@john-markham
Copy link

john-markham commented May 10, 2025

Go version

go version go1.23.8 linux/arm64 (also happens on go version go1.23.8 linux/amd64)

Output of go env in your module/workspace:

Note: this happened during an incident in prod, where I have little ability to run go env. I've included the output from my local as we configure our go env vars similarly between local and our deployed environment. Sorry – this is the best I can do with what I have currently.

GO111MODULE=''
GOARCH='arm64'
GOBIN='/Users/johnmarkham/go/bin'
GOCACHE='/Users/johnmarkham/Library/Caches/go-build'
GOENV='/Users/johnmarkham/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/johnmarkham/go/pkg/mod'
GONOPROXY='...' [internal company gomodules proxy]
GONOSUMDB='...' internal company gomodules proxy]
GOOS='darwin'
GOPATH='/Users/johnmarkham/go'
GOPRIVATE='...' [internal company gomodules proxy]
GOPROXY='...' [internal company gomodules proxy]
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.23.8'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/Users/johnmarkham/Library/Application Support/go/telemetry'
GCCGO='gccgo'
GOARM64='v8.0'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD='/Users/johnmarkham/Desktop/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/h8/jfwwz6r1429168qq4sb24pkr0000gn/T/go-build1078014855=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

Going to explain as if I'm walking up the stack trace included below:

In our GraphQL service, which is gqlgen based, one of our resolvers started panic'ing due to a nil pointer error:

func (r *latestAssetPriceResolver) PercentChanges(ctx context.Context, obj *model.LatestAssetPrice) (*model.PercentChanges, error) {
	resp, err := // fetch data from upstream, resp was nil during incident
        ...
	parsedHour, err := strconv.ParseFloat(resp.PercentChanges.Hour, 64) // nil pointer exception
        ...
}

A deferred function we use for span collection ran immediately after runtime.panicmem():

func (t *Tracer) InterceptField(ctx context.Context, next graphql.Resolver) (interface{}, error) {
        ...
        // Start a new Datadog tracing span for this resolver.
        ddSpan, newCtx := ddtracer.StartSpanFromContext(ctx, TraceOperationName, ddtracer.ResourceName(resolverName))
        ...
	defer func() {
		operationName := getOperationNameFromContext(ctx)
		ddSpan.SetTag("resolver.operation", operationName)
                 ...
		ddSpan.Finish(ddtracer.WithError(err)) // SEGFAULT!
	}()
        ...
}

Our datadog library was attempting to begin the process of "finish"ing the span associated with the resolver.

It attempted to collect a lock belonging to the span, which somehow invoked a frame walk, which ended up triggering a SIGSEGV.

For convenience, here's the following links that correspond to each call that was run in this process, starting from the DD library:

https://github.com/DataDog/dd-trace-go/blob/v1.999.0-rc.27/ddtrace/internal/v2.go#L157
https://github.com/DataDog/dd-trace-go/blob/v2.1.0-dev.1/ddtrace/tracer/span.go#L664
https://github.com/DataDog/dd-trace-go/blob/v2.1.0-dev.1/ddtrace/tracer/span.go#L730
https://github.com/DataDog/dd-trace-go/blob/v2.1.0-dev.1/ddtrace/tracer/spancontext.go#L301
https://github.com/DataDog/dd-trace-go/blob/v2.1.0-dev.1/ddtrace/tracer/spancontext.go#L499
https://cs.opensource.google/go/go/+/refs/tags/go1.24.3:src/internal/sync/mutex.go;l=149 (hm, didn’t exist at go 1.23.8?... strange. anyways, the method called is runtime_SemacquireMutex)
https://github.com/golang/go/blob/go1.23.8/src/runtime/sema.go#L95
https://github.com/golang/go/blob/go1.23.8/src/runtime/sema.go#L194
https://github.com/golang/go/blob/go1.23.8/src/runtime/mprof.go#L513
https://github.com/golang/go/blob/go1.23.8/src/runtime/mprof.go#L563
https://github.com/golang/go/blob/go1.23.8/src/runtime/mprof.go#L592
https://cs.opensource.google/go/go/+/master:src/runtime/signal_unix.go;l=909?q=signal_unix.go:909&sq=&ss=go%2Fgo

Interestingly, it does actually seem like this happened non-deterministically. Some requests seemed to be able to make it to our panic recovery mechanisms. Others, however, SIGSEGV'd and crashed our containers.

We did see #69629, which seems on the surface to be highly related if not the same exact issue. But unfortunately the go-metro FP clobbering issue would not apply here, as go-metro was not invoked at runtime by any of our code. We don't see any other potential misbehaving assembly that could clobber the FP.

We are able to reproduce this in our dev environment, but sadly not locally. Let me know what further information would be helpful to provide.

Tagging @nsrip-dd as he seems to have extensive expertise in issues similar or related to this (e.g. #61766)

What did you see happen?

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x1004017497340 pc=0x4629c]

goroutine 5509 gp=0x401a5e2700 m=6 mp=0x4004834008 [running]:
runtime.throw({0xa025983?, 0x18ec0?})
	/usr/local/go/src/runtime/panic.go:1101 +0x38 fp=0x4019490a60 sp=0x4019490a30 pc=0x89078
runtime.sigpanic()
	/usr/local/go/src/runtime/signal_unix.go:909 +0x36c fp=0x4019490ac0 sp=0x4019490a60 pc=0x8b8bc
runtime.fpTracebackPartialExpand(0x0?, 0x0?, {0x400339b200, 0x87, 0x0?})
	/usr/local/go/src/runtime/mprof.go:592 +0x3c fp=0x4019490b40 sp=0x4019490ad0 pc=0x4629c
runtime.saveblockevent(0x79df4c0, 0x5f5e0ff, 0x6, 0x2)
	/usr/local/go/src/runtime/mprof.go:563 +0x180 fp=0x4019490b90 sp=0x4019490b40 pc=0x46170
runtime.blockevent(0x22c2116f61b?, 0x5)
	/usr/local/go/src/runtime/mprof.go:513 +0x5c fp=0x4019490bd0 sp=0x4019490b90 pc=0x87b4c
runtime.semacquire1(0x400da76704, 0x0, 0x3, 0x2, 0x15)
	/usr/local/go/src/runtime/sema.go:194 +0x2e4 fp=0x4019490c20 sp=0x4019490bd0 pc=0x679c4
internal/sync.runtime_SemacquireMutex(0x8?, 0xa0?, 0x4017496c01?)
	/usr/local/go/src/runtime/sema.go:95 +0x28 fp=0x4019490c60 sp=0x4019490c20 pc=0x8acc8
internal/sync.(*Mutex).lockSlow(0x400da76700)
	/usr/local/go/src/internal/sync/mutex.go:149 +0x170 fp=0x4019490cb0 sp=0x4019490c60 pc=0x9adc0
internal/sync.(*Mutex).Lock(...)
	/usr/local/go/src/internal/sync/mutex.go:70
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:46
sync.(*RWMutex).Lock(0x400da76700)
	/usr/local/go/src/sync/rwmutex.go:150 +0x74 fp=0x4019490ce0 sp=0x4019490cb0 pc=0x9cd34
github.com/DataDog/dd-trace-go/v2/ddtrace/tracer.(*trace).finishedOne(0x400da76700, 0x4017745680)
	/go/pkg/mod/github.com/!data!dog/dd-trace-go/[email protected]/ddtrace/tracer/spancontext.go:499 +0x38 fp=0x4019490eb0 sp=0x4019490ce0 pc=0x1096d88
github.com/DataDog/dd-trace-go/v2/ddtrace/tracer.(*SpanContext).finish(...)
	/go/pkg/mod/github.com/!data!dog/dd-trace-go/[email protected]/ddtrace/tracer/spancontext.go:301
github.com/DataDog/dd-trace-go/v2/ddtrace/tracer.(*Span).finish(0x4017745680, 0x183d9e0d6c083714)
	/go/pkg/mod/github.com/!data!dog/dd-trace-go/[email protected]/ddtrace/tracer/span.go:730 +0x2f4 fp=0x4019490f80 sp=0x4019490eb0 pc=0x108ae44
github.com/DataDog/dd-trace-go/v2/ddtrace/tracer.(*Span).Finish(0x4017745680, {0x4017497050, 0x1, 0x2b5eb28?})
	/go/pkg/mod/github.com/!data!dog/dd-trace-go/[email protected]/ddtrace/tracer/span.go:664 +0x2dc fp=0x4019491000 sp=0x4019490f80 pc=0x108a83c
gopkg.in/DataDog/dd-trace-go.v1/ddtrace/internal.SpanV2Adapter.Finish({0x40047565b9?}, {0x400c93fe00?, 0x0?, 0x85b9000?})
	/go/pkg/mod/gopkg.in/!data!dog/[email protected]/ddtrace/internal/v2.go:158 +0x50 fp=0x4019491060 sp=0x4019491000 pc=0x10bfa70
[redacted - internal company code: tracing extension]
fp=0x40194910e0 sp=0x4019491060 pc=0x2b5ebd0
panic({0x89a2d20?, 0x123f0aa0?})
	/usr/local/go/src/runtime/panic.go:792 +0x124 fp=0x4019491190 sp=0x40194910e0 pc=0x88d04
runtime.panicmem(...)
	/usr/local/go/src/runtime/panic.go:262
runtime.sigpanic()
[redacted - internal company code: graphql resolver]
	/home/runner/_work/central-builder/central-builder/internal/graph/generated/asset.generated.go:24671 +0x30 fp=0x4019491f60 sp=0x4019491f30 pc=0x3e6c8e0
github.com/99designs/gqlgen/graphql.(*FieldSet).Dispatch.func1({0x1?, 0x401943b578?})
	/go/pkg/mod/github.com/99designs/[email protected]/graphql/fieldset.go:50 +0x6c fp=0x4019491fb0 sp=0x4019491f60 pc=0x2b1e83c
github.com/99designs/gqlgen/graphql.(*FieldSet).Dispatch.gowrap1()
	/go/pkg/mod/github.com/99designs/[email protected]/graphql/fieldset.go:51 +0x34 fp=0x4019491fd0 sp=0x4019491fb0 pc=0x2b1e794
runtime.goexit({})
	/usr/local/go/src/runtime/asm_arm64.s:1223 +0x4 fp=0x4019491fd0 sp=0x4019491fd0 pc=0x91a84
created by github.com/99designs/gqlgen/graphql.(*FieldSet).Dispatch in goroutine 5388
	/go/pkg/mod/github.com/99designs/[email protected]/graphql/fieldset.go:48 +0x110
...

What did you expect to see?

No SIGSEGV.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label May 10, 2025
@gabyhelp gabyhelp added the BugReport Issues describing a possible bug in the Go implementation. label May 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BugReport Issues describing a possible bug in the Go implementation. compiler/runtime Issues related to the Go compiler and/or runtime.
Projects
None yet
Development

No branches or pull requests

3 participants