Skip to content

runtime: severe performance drop for cgo calls in go1.22.5 [1.23 backport] #69988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gopherbot opened this issue Oct 22, 2024 · 9 comments
Closed
Labels
CherryPickApproved Used during the release process for point releases compiler/runtime Issues related to the Go compiler and/or runtime.
Milestone

Comments

@gopherbot
Copy link
Contributor

@aclements requested issue #68587 to be considered for backport to the next 1.23 minor release.

@gopherbot , please open a backport to 1.23.

@gopherbot gopherbot added the CherryPickCandidate Used during the release process for point releases label Oct 22, 2024
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Oct 22, 2024
@gopherbot gopherbot added this to the Go1.23.3 milestone Oct 22, 2024
@aclements
Copy link
Member

This causes a dramatic drop in C -> Go call performance under some circumstances.

@Arup-Chauhan
Copy link

Hi @aclements, I wanted to work, this can be a potential workaround.

// Workaround: Use `runtime.LockOSThread()` to minimize thread switching overhead.
// This ensures the goroutine stays on the same OS thread during C -> Go calls.
runtime.LockOSThread()
defer runtime.UnlockOSThread()


Would this be viable? Feedback appreciated.

@arl
Copy link
Contributor

arl commented Oct 26, 2024

@Arup-Chauhan the problem has already been worked on, the changes should be in next go release and also backported to go1.23. In fact, we're on the backporting issue here.
See #68587

@Arup-Chauhan
Copy link

Thankyou for the update @arl , I will try to contribute to some other issue then

@dr2chase dr2chase added the Critical A critical problem that affects the availability or correctness of production systems built using Go label Nov 6, 2024
@dmitshur
Copy link
Member

dmitshur commented Nov 6, 2024

We moved the NextMinor label to the parent tracking issue #68587, removing here.

@dmitshur dmitshur removed the Critical A critical problem that affects the availability or correctness of production systems built using Go label Nov 6, 2024
@gopherbot gopherbot modified the milestones: Go1.23.3, Go1.23.4 Nov 6, 2024
@gopherbot gopherbot modified the milestones: Go1.23.4, Go1.23.5 Dec 3, 2024
@cagedmantis
Copy link
Contributor

This is a severe performance regression. This change has been approved.

@cagedmantis cagedmantis added the CherryPickApproved Used during the release process for point releases label Dec 11, 2024
@gopherbot gopherbot removed the CherryPickCandidate Used during the release process for point releases label Dec 11, 2024
@cagedmantis
Copy link
Contributor

A backport CL is required for this issue. (Messaging @cherrymui since she is the owner of the original CL).

@gopherbot
Copy link
Contributor Author

Change https://go.dev/cl/635775 mentions this issue: [release-branch.go1.23] runtime: update and restore g0 stack bounds at cgocallback

gopherbot pushed a commit that referenced this issue Dec 13, 2024
…t cgocallback

Currently, at a cgo callback where there is already a Go frame on
the stack (i.e. C->Go->C->Go), we require that at the inner Go
callback the SP is within the g0's stack bounds set by a previous
callback. This is to prevent that the C code switches stack while
having a Go frame on the stack, which we don't really support. But
this could also happen when we cannot get accurate stack bounds,
e.g. when pthread_getattr_np is not available. Since the stack
bounds are just estimates based on the current SP, if there are
multiple C->Go callbacks with various stack depth, it is possible
that the SP of a later callback falls out of a previous call's
estimate. This leads to runtime throw in a seemingly reasonable
program.

This CL changes it to save the old g0 stack bounds at cgocallback,
update the bounds, and restore the old bounds at return. So each
callback will get its own stack bounds based on the current SP,
and when it returns, the outer callback has the its old stack
bounds restored.

Also, at a cgo callback when there is no Go frame on the stack,
we currently always get new stack bounds. We do this because if
we can only get estimated bounds based on the SP, and the stack
depth varies a lot between two C->Go calls, the previous
estimates may be off and we fall out or nearly fall out of the
previous bounds. But this causes a performance problem: the
pthread API to get accurate stack bounds (pthread_getattr_np) is
very slow when called on the main thread. Getting the stack bounds
every time significantly slows down repeated C->Go calls on the
main thread.

This CL fixes it by "caching" the stack bounds if they are
accurate. I.e. at the second time Go calls into C, if the previous
stack bounds are accurate, and the current SP is in bounds, we can
be sure it is the same stack and we don't need to update the bounds.
This avoids the repeated calls to pthread_getattr_np. If we cannot
get the accurate bounds, we continue to update the stack bounds
based on the SP, and that operation is very cheap.

On a Linux/AMD64 machine with glibc:

name                     old time/op  new time/op  delta
CgoCallbackMainThread-8  96.4µs ± 3%   0.1µs ± 2%  -99.92%  (p=0.000 n=10+9)

Updates #68285.
Updates #68587.
Fixes #69988.

Change-Id: I3422badd5ad8ff63e1a733152d05fb7a44d5d435
Reviewed-on: https://go-review.googlesource.com/c/go/+/600296
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Michael Pratt <[email protected]>
(cherry picked from commit 76a8409)
Reviewed-on: https://go-review.googlesource.com/c/go/+/635775
@gopherbot
Copy link
Contributor Author

Closed by merging CL 635775 (commit 59b7d40) to release-branch.go1.23.

yangjuncode pushed a commit to yangjuncode/go that referenced this issue Dec 27, 2024
…t cgocallback

Currently, at a cgo callback where there is already a Go frame on
the stack (i.e. C->Go->C->Go), we require that at the inner Go
callback the SP is within the g0's stack bounds set by a previous
callback. This is to prevent that the C code switches stack while
having a Go frame on the stack, which we don't really support. But
this could also happen when we cannot get accurate stack bounds,
e.g. when pthread_getattr_np is not available. Since the stack
bounds are just estimates based on the current SP, if there are
multiple C->Go callbacks with various stack depth, it is possible
that the SP of a later callback falls out of a previous call's
estimate. This leads to runtime throw in a seemingly reasonable
program.

This CL changes it to save the old g0 stack bounds at cgocallback,
update the bounds, and restore the old bounds at return. So each
callback will get its own stack bounds based on the current SP,
and when it returns, the outer callback has the its old stack
bounds restored.

Also, at a cgo callback when there is no Go frame on the stack,
we currently always get new stack bounds. We do this because if
we can only get estimated bounds based on the SP, and the stack
depth varies a lot between two C->Go calls, the previous
estimates may be off and we fall out or nearly fall out of the
previous bounds. But this causes a performance problem: the
pthread API to get accurate stack bounds (pthread_getattr_np) is
very slow when called on the main thread. Getting the stack bounds
every time significantly slows down repeated C->Go calls on the
main thread.

This CL fixes it by "caching" the stack bounds if they are
accurate. I.e. at the second time Go calls into C, if the previous
stack bounds are accurate, and the current SP is in bounds, we can
be sure it is the same stack and we don't need to update the bounds.
This avoids the repeated calls to pthread_getattr_np. If we cannot
get the accurate bounds, we continue to update the stack bounds
based on the SP, and that operation is very cheap.

On a Linux/AMD64 machine with glibc:

name                     old time/op  new time/op  delta
CgoCallbackMainThread-8  96.4µs ± 3%   0.1µs ± 2%  -99.92%  (p=0.000 n=10+9)

Updates golang#68285.
Updates golang#68587.
Fixes golang#69988.

Change-Id: I3422badd5ad8ff63e1a733152d05fb7a44d5d435
Reviewed-on: https://go-review.googlesource.com/c/go/+/600296
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Michael Pratt <[email protected]>
(cherry picked from commit 76a8409)
Reviewed-on: https://go-review.googlesource.com/c/go/+/635775
mpminardi pushed a commit to tailscale/go that referenced this issue Jan 28, 2025
…t cgocallback

Currently, at a cgo callback where there is already a Go frame on
the stack (i.e. C->Go->C->Go), we require that at the inner Go
callback the SP is within the g0's stack bounds set by a previous
callback. This is to prevent that the C code switches stack while
having a Go frame on the stack, which we don't really support. But
this could also happen when we cannot get accurate stack bounds,
e.g. when pthread_getattr_np is not available. Since the stack
bounds are just estimates based on the current SP, if there are
multiple C->Go callbacks with various stack depth, it is possible
that the SP of a later callback falls out of a previous call's
estimate. This leads to runtime throw in a seemingly reasonable
program.

This CL changes it to save the old g0 stack bounds at cgocallback,
update the bounds, and restore the old bounds at return. So each
callback will get its own stack bounds based on the current SP,
and when it returns, the outer callback has the its old stack
bounds restored.

Also, at a cgo callback when there is no Go frame on the stack,
we currently always get new stack bounds. We do this because if
we can only get estimated bounds based on the SP, and the stack
depth varies a lot between two C->Go calls, the previous
estimates may be off and we fall out or nearly fall out of the
previous bounds. But this causes a performance problem: the
pthread API to get accurate stack bounds (pthread_getattr_np) is
very slow when called on the main thread. Getting the stack bounds
every time significantly slows down repeated C->Go calls on the
main thread.

This CL fixes it by "caching" the stack bounds if they are
accurate. I.e. at the second time Go calls into C, if the previous
stack bounds are accurate, and the current SP is in bounds, we can
be sure it is the same stack and we don't need to update the bounds.
This avoids the repeated calls to pthread_getattr_np. If we cannot
get the accurate bounds, we continue to update the stack bounds
based on the SP, and that operation is very cheap.

On a Linux/AMD64 machine with glibc:

name                     old time/op  new time/op  delta
CgoCallbackMainThread-8  96.4µs ± 3%   0.1µs ± 2%  -99.92%  (p=0.000 n=10+9)

Updates golang#68285.
Updates golang#68587.
Fixes golang#69988.

Change-Id: I3422badd5ad8ff63e1a733152d05fb7a44d5d435
Reviewed-on: https://go-review.googlesource.com/c/go/+/600296
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Michael Pratt <[email protected]>
(cherry picked from commit 76a8409)
Reviewed-on: https://go-review.googlesource.com/c/go/+/635775
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CherryPickApproved Used during the release process for point releases compiler/runtime Issues related to the Go compiler and/or runtime.
Projects
None yet
Development

No branches or pull requests

7 participants