-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: epoll scalability problem with 192 core machine and 1k+ ready sockets #65064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There is a small chance that #56424 is related, though it seems unlikely as that was at a much smaller scale. |
We're running on the
We don't have a reproducer for this problem right now unfortunately, but our suspicion is that it should be easy to replicate by serving or making hundreds of thousands of fast network requests in a go application using TCP.
We don't have a
We did not try increasing the buffer size, it wasn't apparent there was a way to do that without running a custom build of Go and at the time running more than one container was a more accessible solution for us. Thanks for looking into this, it was definitely a interesting thing to find in the wild! |
For some more context, the EpollWait time in the profile was 2800 seconds on a 30 second profile. Also I don't necessarily think that the epoll buffer itself is the problem, rather just how epoll works under the hood with thousands of 'ready' sockets and hundreds of threads. The application under load had around 3500 open sockets, http2 clients making requests to our grpc service on one end and us making requests to scyllaDB on the other. |
Thanks for the details! I'll try to write a reproducer when I have some free time, not sure when I'll get to it.
Indeed, you'd need to manually modify the runtime. Note that is possible to simply edit the runtime source in GOROOT and rebuild your program (no special steps required for the runtime, it is treated like any other package). But if you build in a Docker container it is probably a pain to edit the runtime source. |
Some thoughts from brainstorming for posterity: My best theory at the moment (though I'd really like to see perf to confirm) is that ~90 threads are calling epoll_wait at once (probably at this non-blocking netpoll: https://cs.opensource.google/go/go/+/master:src/runtime/proc.go;l=3230;drc=dcbe77246922fe7ef41f07df228f47a37803f360). The kernel has a mutex around the entire copy-out portion of epoll_wait, so there is probably a lot of time waiting for the mutex. If that is the case, some form of rate-limiting on how many threads make the syscall at once may be effective. N.B. that this non-blocking netpoll is not load-bearing for correctness, so occasionally skipping it would be OK. |
Yeah, it was the netpoll call inside findRunnable (though i didnt have my source mapping set up at the time to confirm the exact line numbers). I've also got a spare test machine with the same CPU i can use to try out a repro test case as well. |
is go using the same epoll instance accross all threads? that might be the underlying problem, most high-throughput applications (nginx, envoy, netty) create several instances (usually one per thread together with an event loop) and connections get distributed to all epoll instances some way or another. |
Good point! And to answer your question, yes, Go has been using the single (and global) From where I stand, I reckon that refactoring the current To sum up, multiple |
using multiple |
This is one of the potential issues we may encounter and need to resolve if we decide to introduce multiple I actually drafted a WIP implementation of multiple |
A casual observation (not go specific): one reason epoll doesn't scale well when a single epoll instance is shared across threads is the file descriptor table, which is typically shared across the process. This is one of the reasons why, say, 8 separate processes usually performs better than a single process with 8 threads. The impact is present both with multiple epoll instances (per thread), or a single epoll instance shared across threads. The way to circumvent this is to unshare (syscall) the file descriptor table across threads upon thread creation, then create an epoll instance per thread. This yields similar performance to a multi process approach (within 1% in my experience). After that you can distribute the work however you want, maybe with SO_REUSEPORT. Also, be careful unsharing the file descriptor table, it is not appropriate for all situations. Side note, if you are sharing an epoll instance across threads you should use edge triggered to avoid all threads from being woken up, most unnecessarily. This is my experience anyway when using a thread per core model, although the principle would apply regardless of the number of threads. I don't know anything about go internals so I'll leave it there. |
I don't want to derail this issue, let me know if I should move this to a separate bug... We are seeing a similar issue on a system with 128 cores, we're only reading from 96 Unix Sockets, 1 per goroutine. Go was spending much time in I'm looking for the profiles from the Go App, in the mean time I can share that we reproduced this issue with a simple I wrote a workaround that does not invoke Let me know if there's anything I can do to help. |
These kernel patches may be of interest: |
Just to make sure I don't misread what said, you achieved that by using raw syscalls of |
Correct. I'll ask today if I can share an example. |
I think it would be great if Go runtime could maintain a separate epoll file descriptor (epfd) per each P. Then every P could register file descriptors in its own local epfd and call
Such a scheme may result in imbalance of goroutines among P workers, if a single goroutine creates many network connections (e.g. server |
I agree that most likely we need multiple epoll FDs, with some sort of affinity. @bwerthmann , since you're able to get perf profiles, could you get one with It would be really helpful if someone could create a benchmark that reproduces this issue. If it can be done with only 96 UNIX domain sockets, it may not even be especially hard. |
If we want to go deep here, it might even be possible for the Go scheduler to become RX queue aware using sockopts like |
@aclements profile as requested. Taken with |
epoll contention on TCP causes latency build-up when we have high volume ingress. This PR is an attempt to relieve this pressure. upstream issue golang/go#65064 seems to be a deeper problem, haven't yet tried the fix provide in this issue but however this change without changing the compiler helps. Of course this is a workaround.
epoll contention on TCP causes latency build-up when we have high volume ingress. This PR is an attempt to relieve this pressure. upstream issue golang/go#65064 seems to be a deeper problem, haven't yet tried the fix provide in this issue but however this change without changing the compiler helps. Of course this is a workaround.
epoll contention on TCP causes latency build-up when we have high volume ingress. This PR is an attempt to relieve this pressure. upstream issue golang/go#65064 seems to be a deeper problem, haven't yet tried the fix provide in this issue but however this change without changing the compiler helps. Of course this is a workaround.
epoll contention on TCP causes latency build-up when we have high volume ingress. This PR is an attempt to relieve this pressure. upstream issue golang/go#65064 seems to be a deeper problem, haven't yet tried the fix provide in this issue but however this change without changing the compiler helps. Of course this is a workaround.
epoll contention on TCP causes latency build-up when we have high volume ingress. This PR is an attempt to relieve this pressure. upstream issue golang/go#65064 seems to be a deeper problem, haven't yet tried the fix provide in this issue but however this change without changing the compiler helps. Of course this is a workaround.
epoll contention on TCP causes latency build-up when we have high volume ingress. This PR is an attempt to relieve this pressure. upstream issue golang/go#65064 It seems to be a deeper problem; haven't yet tried the fix provide in this issue, but however this change without changing the compiler helps. Of course, this is a workaround for now, hoping for a more comprehensive fix from Go runtime.
okay, I found a reliable though somewhat clunky reproducer and tested with EPYC 7443P and EPYC 9754 and go1.23.5 and go1.23.5+https://go.dev/cl/564197. Reproducer: random-socket-reader EPYC 9754 (128 cores, 256 threads)
go1.23.5flamegraph image | profile (pprof.host | download) go1.23.5+564197flamegraph image | profile (pprof.host | download) EPYC 7443P (24 cores, 48 threads)
go1.23.5flamegraph image | profile (pprof.host | download) go1.23.5+564197flamegraph image | profile (pprof.host | download) testing "process"I haven't done this in a while now, so if anyone wants to double-check if I compiled go1.23.5+564197 correctly, here's a list of steps:
It seems like CL 564197 does help in some way, but I'm unsure whether the CPU time spent on locks in
@prattmic @ianlancetaylor if it would be helpful, I'm happy to give you access to the test machines. Also let me know if these pprof profiles are enough or if perf profiles going down to the kernel calls are more helpful here. |
I was able to test CL 564197 with my colleagues today on one of our prod services. Here's the result: Before: After: (30s) pprof.host | (90s) pprof.host It looks like that CL fully alleviates the epoll issue. I don't know what to think about the runtime still taking up 40% of the CPU time, but there's a good chance this is just an inefficiency of our code and excessive allocations, as it looks like it's mostly GC. While it seems effective here, I'm still concerned about the potential side effect of slower network I/O. It likely won't matter for most cases, aside from highly intensive I/O workloads like the synthetic test I mentioned earlier, but I'm unsure how to confirm it won't impact Go programs more broadly. |
@amwolff Thank you for the reproducer and production experiment! If you don't mind, could you share a bit more about the scale of this production workload? How many cores/GOMAXPROCS? How many QPS it is handling (assuming that is a relevant metric)? |
@prattmic Yes, of course. To explain the scale of the production workload a little better, a bit more context: one of services we run is a program that takes the inbound traffic, which is usually a large file(s), processes it and sends the processed data (expended by a constant factor) to many smaller servers we call storage nodes. In short: an upload to a machine like the one I took a profile on results in many (100+) connections and uploads to other machines in different locations. I suspect this characteristic is why the scalability problem shows up so well there. Profiles in my previous comment are from a EPYC 9754 (128 cores, 256 vCPUs=GOMAXPROCS) machine, which we're testing in one of our locations. We typically run a fleet of 7443Ps (24 cores) but would love to see Go scale to higher core count ones for various reasons. Some data from the time of the experiment: Network traffic: Network packets: QPS: CPU: RAM: Profiles were taken somewhere around the peaks. Let me know if you would like to see anything additional. For comparison, here's a profile from a 7443P machine (without CL 564197): https://pprof.host/vc40/flamegraph |
Thanks @amwolff, that's great. The GOMAXPROCS=24 profile is a nice comparison as well.
Actually, most of this is the scheduler, not the GC. The GC is primary in The time under Other parts under If you are willing to share a short (~1-5s) execution trace from the application that could better show the scheduling behavior. You can email me privately if you don't want to share publicly. |
Thanks so much @prattmic. I compiled our application with 1.24rc2+564197 and compared it with 1.23.5+564197 and 1.23.5 compilations in a brief load test. Some observations:
One phenomenon we couldn't explain during the load test is we couldn't get past certain CPU usage and GBit/s of traffic served, but since our testing setup is still a bit ad-hoc, afterward I realized we started to hit some limits (like conntrack's) that would explain all of that. We need to review those, make sure they won't be limiting factors and retest again, but 1.24rc2 + CL 564197 is immediately useful to us. Thank you! I emailed you with a spreadsheet of traces and profiles. For others, this is the matrix that I sent: For now, I will limit myself to just posting 1.24rc2+564197 CPU profiles here:
|
Thanks, this is very useful! Here are two views from ~5ms of the traces. go1.23.5: go1.24rc2+564197 In the 1.23.5 trace, all the blank spaces are times when that P is in the scheduler instead of running a goroutine, presumably spending way way too much time in epoll. The trace is actually very interesting because you can see each P run a clump of short-running goroutines between each gap. I think that is the P running all of the things in its local run queue before it needs to go look for more work. There are lots of goroutines that run for very short periods (1-5us). Those are probably contributing to the epoll probably by entering the scheduler very frequently. The go1.24rc2+564197 trace looks much better. Densely packed and spending most time running goroutines, even though there are still very short running goroutines around. From the go1.24rc2+564197 profiles above, I would say that 1.24 + CL 564197 completely solves the scheduling scalability issues for this application. Only 6% of time is in runtime.schedule, which is in range for a normal Go program. There is definitely still a scalability problem in the GC around managing work buffers. I think that is most related to #21056, so I will post discussion of those symptoms there. |
For golang#65064 Change-Id: Ifecd7e332d2cf251750752743befeda4ed396f33
For golang#65064 Change-Id: Ifecd7e332d2cf251750752743befeda4ed396f33
For golang#65064 Change-Id: Ifecd7e332d2cf251750752743befeda4ed396f33
Will CL 564197 be merged anytime soon? Will this be backported to go1.23.x? |
We will likely merge CL 564197 (or something similar) for Go 1.25. I don't expect that we will backport to 1.23 or 1.24 [1], as this is primarily a performance improvement. [1] I know 1.24 isn't released yet, but it is mere days away! |
Okay we will keep back relevant backports for this locally. |
epoll contention on TCP causes latency build-up when we have high volume ingress. This PR is an attempt to relieve this pressure. upstream issue golang/go#65064 It seems to be a deeper problem; haven't yet tried the fix provide in this issue, but however this change without changing the compiler helps. Of course, this is a workaround for now, hoping for a more comprehensive fix from Go runtime.
For golang#65064 Change-Id: Ifecd7e332d2cf251750752743befeda4ed396f33
For golang#65064 Change-Id: Ifecd7e332d2cf251750752743befeda4ed396f33
For #65064 Change-Id: Ifecd7e332d2cf251750752743befeda4ed396f33 Reviewed-on: https://go-review.googlesource.com/c/go/+/564197 LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Artur M. Wolff <[email protected]> Reviewed-by: Carlos Amedee <[email protected]> Reviewed-by: Mauri de Souza Meneguzzo <[email protected]> Reviewed-by: Michael Pratt <[email protected]>
Split from #31908 (comment) and full write-up at https://jazco.dev/2024/01/10/golang-and-epoll/.
tl;dr is that a program on a 192 core machine with >2500 sockets and with >1k becoming ready at once results in huge costs in
netpoll -> epoll_wait
(~65% of total CPU).Most interesting is that sharding these connections across 8 processes seems to solve the problem, implying some kind of super-linear scaling.
That the profile shows the time spent in
epoll_wait
itself, this may be a scalability problem in the kernel itself, but we may still be able to mitigate.@ericvolp12, some questions if you don't mind answering:
perf
profile of this problem that shows where the time in the kernel is spent?cc @golang/runtime
The text was updated successfully, but these errors were encountered: