-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: bad performance of core GO features (GC, go routines, channels, map) #32195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Just want to point out that "abominable" is not a good way to start a discussion about performance. Constructive issues get attention and often positive outcomes, but this doesn't seem constructive to me. Also, it would be much better to raise issues about specific performance problems you've encountered, and reliable ways to reproduce them. This issue is quite massive, so it would be hard to stay on topic and tackle any specific problems separately. |
I have done some deeper digging and (although yet speculative) might have found a reason for the bad performance that can not be found in software. It might be a pure hardware problem. GO "claims" to have no problems with numbers of go routines that far exceeds the number of available CPU cores. To a degree that is accurate and since go routines are "distributed" over a fixed number of worker threads (about as many as there are CPU cores). HOWEVER . . . there is still an excessive amount of context switching in such a worker thread and each time such a switch occurs the data which is in the L1/L2/L3 cache of the CPU core that runs the worker thread becomes "useless". So the excessive amount of context switching makes that the CPU cache can not do its normal function. With normal OS round-robin scheduling the switches between processes is determined by the OS and is far less (in the order of e.g. 50x per second) so that the cache is only useless every 20msec or so. GO has broken this rule and can have (depending on the application) 1000x more context switches per worker thread per second. I have tried to test this hypothesis by adding...
... to the loop that launches the go routines. So the BurnCPU() now looks like this:
The results now look better and the blocking of the interactive part of the program is gone. Ideal solution would be that this kind of logic would be part of the "go" command itself. So the basic idea is that any "go" would block the routine that tries to do it until the number of already active routines becomes less than the number of available worker threads. How this influences existing code and benchmarks has to be investigated first of course. Another alternative could be to add a function "WaitGroup.Go(func)" that does this checking only based on the number of routines in the WorkGroup. Then the WorkGroup.Add(1) is done by WaitGroup.Go(func) and you would only have to call WorkGroup.Done() at the end of the started go routine. |
The implementation of your BurnCPU function doesn't make sense to me. Why are you spawning 10000 goroutines? In an hypothetical C implementation of BurnCPU, would you spawn 10000 OS threads? No, obviously... that would just completely trash the OS scheduler without helping you actually burning CPU in any way. But in your Go code, you're doing exactly that.
This is only true when the goroutines are spawned to execute light-weigth tasks, or tasks that block. In every other case, it makes no sense to spawn more goroutines than cores you have on your machine. Experienced Go developers are well aware of this. The number of goroutines that are tasked to execute heavy workloads (like "burning" CPU in your test) needs to be limited to the amount of cores on your machine; if anything just for the reason that anyway you cannot do more CPU-heavy concurrent work than the number of cores on your machine. Your BurnCPU function needs to be changed so that it only spawns an amount of goroutines that is exactly equal to the amount of threads that can be executed concurrently on your machine (e.g.: 4 or 8). Otherwise you're just trashing the goroutines scheduler for no reason. Then the goroutines that are in charge of the GUI get trapped in the 10000 goroutines scheduler-hell you created, and the UI won't be responsive. In general, from the code and the message you posted, it appears that you have a very superficial understanding on how Go works and, more importantly, how Go primitives like goroutines should be used. I encourage you to gain more experience in writing idiomatic, high-performance Go code before you trust yourself with an opinion about whether the language is up to a specific task you have to accomplish. I'm not saying Go is definitely the right choice for your application: maybe it's not. |
Preemptive goroutine scheduling is in progress, probably landing in 1.14, see #24543. Also, channels are not the fastest possible way to synchronize goroutines. The sync and sync/atomic packages are more efficient, if harder to get right. "Sharing memory by communicating" only takes you so far :-) And if I may, I'd encourage everyone to be more gentle in tone. |
The constructed benchmark is in scale and load equivalent to the target application which maintains a distributed AR scene with persistent shape count in the range 1 to 100 million objects, each having their own local methods that can use go routines. The abilities of C++ both on CPU and GPU are in that performance range. |
It's hard to make sensible comments about the kind of high-level architecture an application of this level of complexity should have, but as as a general observation, if those millions of objects could potentially fire-up CPU-intensive tasks, each on in their own goroutines, concurrently, I don't think this kind of architecture can and will work, at all, for the reasons I have stated above. So basically it appears that the main issue you have here is that your Go CPUBurner prototype is modelling an architecture that is fundamentally not suitable for the kind of application you'd like to build. My impression is that the fact that Go provides certain features (like goroutines and channels) at a language-level has misled you into producing a design that is not suitable for your needs, and more importantly:
the design of your Go prototype is fundamentally different from the one you would produce if you had to code this in C/C++, since I'm sure that if you were to build this in C++ you wouldn't spawn a thread of computation each time one of the millions of live objects needs to do something, and you wouldn't expect the application to work while thousands of these threads are all executing in a given moment. While it's true that spawning goroutines to perform atomic tasks and communicating using channels is often the good and idiomatic way to write Go code, this is not always true. Sometimes, the kind of application you need to build forces you to design an architecture that comes closer to the one you'd design when you're writing C/C++ code. |
I'm not surprised that your GUI response is terrible. You have 10000 goroutines, and only one of them is the GUI goroutine. The runtime has no way to know which of those 10000 is your latency critical one. The scheduler tries to be at least somewhat fair to all the goroutines, which means your GUI one is only getting 1/10000 of the machine. A fix for this kind of application would require goroutine priorities. We don't have that at the moment, but you can simulate it with
You're not reading and writing channels in your loop. The only operation in the loop is a waitgroup add and a goroutine spawn. The operations inside the goroutine don't count. Nevertheless, your use of
I'm not sure where you are getting that from.
The GC is concurrent with program execution. The GC stops are very short and most of the GC work is done while the application is also running. That may explain the numbers you're seeing.
Maps are definitely slower than arrays. 18x does not sound unreasonable. Does that mean maps are really slow, or arrays are really fast? Array access can be as little as one instruction. Map access is at least a function call, and is hundreds of instructions even on the fast path.
I'm not sure how this relates to the benchmark provided. How are you determining that GC cleanup is blocking all other goroutines? It shouldn't, unless there's a bug somewhere. If there's a bug, we'd need to understand how you're reaching that conclusion. Ideally, with a self-contained example that we can run.
Something like this would be really nice. We'd love to be able to stop the "spawning goroutine" when we have enough ready goroutines to saturate the cpus, and only resume the spawning goroutine when some of the spawned goroutines block or finish. Unfortunately, there's no easy way to look into the future behavior of goroutines; maybe the goroutine that was the spawning goroutine is actually done spawning and will next do something latency sensitive. There might be heuristics we can do, but it's tricky. |
Agreed, but applies also to "runtime.NumGoroutine()" as the number of go routines can increase or decrease between the moment we test it and the moment we launch our own go routine. Without the throttling logic e.g. ...
... the amount of used memory for the same logic fluctuates wildly. With ...
...the amount of used (private allocated virtual pages) memory is 12Gb while with...
...the amount of used (private allocated virtual pages) memory is 260Mb. That is while in theory the amount of used CPU or memory should not change. |
Finished optimisation by making my own version of the go routine logic.
Throttle.go
Barrier.go
|
What is the key to the latter code's success? Using the Windows API for scheduling? Have you used the stdlib testing.B benchmark feature? Could it apply here? I'd find it easier to evaluate the slow and fast samples if they were compared one after the other. Also it would help to benchmark just the scheduling mechanism, as any workload to be scheduled takes the same time in either case. Why do you need func doit (vs |
Avoiding that the L1,L2,L3 cache of the processor is not used due to an extreme number of context switches per second. The solution has as main logic that the start of the next go routine is delayed until one of the processor cores becomes available. So if you need to start 1000 go routines and have an 8 core processor then the first 8 start immediately and the 9th must wait until one of the first 8 is done and then every time any of the go routines is done, the next one starts.
Barrier.go only exists because neither channels and neither any of the sync objects worked for this purpose.
No. used stopwatch for timing and windows task manager for looking at amount of private allocate virtual memory.
Yes, but if you have severe software timing troubles measuring time duration with software is not guaranteed to be accurate.
Yes deeper investigation is required.
The defer is needed to make sure that the interlocked increment of the "done" counter is also done in case there was an error in the execution of the passed doit() action. That is cleanup code that always must be executed and hence there is one extra level of call on the stack. |
Did you try using a buffered channel as a semaphore, like:
Then let each goroutine get/put the channel on entry/exit. That will limit the number of running goroutines. |
So, basically, you implemented goroutine pool. There're countless of libraries that let you do exactly that. As was said earlier, the scheduler tries to be fair and gives execution time to every goroutine. Just like you wouldn't start 1000 CPU intensive OS threads in, say, C++, you wouldn't start 1000 CPU intensive goroutines. You would use thread/goroutine pool. Go is not doing any magic here. The general rule is pretty simple. For IO intensive work where you mostly wait on IO, you can spawn thousands of goroutines. For CPU intensive work you spawn as much goroutines as you have CPU cores/threads. |
Really embarrassing the lack of insight in the responses here (including the number of thumbs up for imaginary scenarios that didn't work) . |
What's there to try? Your case is a pretty simple one - a bunch of CPU work. The solution is also pretty simple and well known - thread pool. Unless your code is doing something else you didn't mention, it's just a thread pool. You said it yourself. |
@StephanVerbeeck once again, please keep it civil. The people in this thread are trying to figure out what's going on and help. |
I regularly use buffered channels as semaphores; they work fine. If you can show that a buffered channel is vastly less efficient than your barrier implementation in the WinAPI, benchmark both with code that only exercises Open/Wait() or channel get/put, then post your benchmark code to a new issue proposing a new sync type. Also you might want to look around for third-party semaphore implementations. There's probably a handful doing native API calls like yours. |
This thread is about go routines not doing as they advertise. |
@StephanVerbeeck noone is attacking or insulting you. If anything, in this thread you are the only one being agressive. I really don't like temporarily locking heated threads, but that will be the only option left if you don't follow https://golang.org/conduct. In particular, there's no need to write in all caps, give instructions with exclamation marks, or refer to others as "all you smart people". |
To run a lot of cpu-intensive tasks on a few cores, you can either try to juggle them all, at a cost of context switching, or prioritize them, at the cost of importing or writing scheduling logic. The Go runtime cannot read your mind that you need prioritization; after all there are endless ways to prioritize things. I regret that you got a different impression (altho I'm not to blame me for any Go docs :-) But it's not hard to write (or find on Github) cross-platform scheduling logic in Go. @mvdan, I think you can close this issue now. |
Possibly related: #32398 |
We measured the impact of GC related to the amount of persistent memory objects.
Existing interactive accounting program used 44K to 90K of persistent heap objects.
Next multiplatform distributed VR/AR program will be using at least 1000K persistent heap objects.
Tested by adding extra functions BurnCPU() on the background that allocates small data structures and keeps less than 1% of the created structures persistent.
Environment
Expected
Observed (After adding "go BurnCPU()" as first line of the main() function)
Benchmark code:
code:
Conclusions:
Variants:
Assumptions based on these observations:
The text was updated successfully, but these errors were encountered: