Closed
Description
We measured the impact of GC related to the amount of persistent memory objects.
Existing interactive accounting program used 44K to 90K of persistent heap objects.
Next multiplatform distributed VR/AR program will be using at least 1000K persistent heap objects.
Tested by adding extra functions BurnCPU() on the background that allocates small data structures and keeps less than 1% of the created structures persistent.
Environment
- OS = windows 10
- go version = 1.12.5 windows/amd64
Expected
- "BurnCPU" routine running in te background would slow-down the entire system (both the program that contains it and other programs running on windows) by the excerted stress on memory and CPU.
- Added function does not share memory or other resources with program to which it was addes and due should have little affect on it (other than slowing the entire system down).
- "BurnCPU" will not block normal usage of the interactive application that contains it.
- BurnCPU will not be significantly slower than native C equivalent.
Observed (After adding "go BurnCPU()" as first line of the main() function)
- Creation of interactive GUI window is delayed 30sec to 200sec
- Normal usage of GUI is only possible after BurnCPU has run to completion
- Multi threading version was slower then single threaded (core i7 at 4Ghz with 16Gb RAM)
- usage of channels degraded performance instead of raising it
- both bufferd and unbuffered channels were unusable due to introduced slowdown
- map became extreme slow when having many items (using simple integer index with simple pointer as data)
- locking between go routines was extreme even after tweaking co-operative multi-tasking ( using runtime.Gosched() )
Benchmark code:
- was added to proprietary code (that can not be listed here) that was based on the lxn/walk GUI library for windows
- effect can be reproduced by using public available example and adding "go BurnCPU()" as first line of its main() function in https://github.com/lxn/walk/blob/master/examples/filebrowser/filebrowser.go
code:
func main() {
go BurnCPU()
// <- code of existing GUI application remains here
}
func BurnCPU() {
type aha struct{ a, b int }
oho := map[int]*aha{}
com := make(chan *aha, 100)
wg := sync.WaitGroup{}
go func() {
for {
c := <-com
oho[c.a+c.b] = c
}
return
}()
for a := 1; a < 1e5; a++ {
wg.Add(1)
go func(a int) {
for b := 1; b < 1e5; b++ {
b += 1234
c := new(aha) // allocate some memory to cause GC
c.a = 10 * (a)
c.b = 10 * (b - 1234)
b -= 1234
if b%1000 == 0 {
com <- c
runtime.Gosched()
}
}
wg.Done()
return
}(a)
runtime.Gosched()
}
wg.Wait()
memStats := runtime.MemStats{}
runtime.ReadMemStats(&memStats)
return // <- set debug break here and examine memStats
}
Conclusions:
- co-operative multi-tasking tweaked by "runtime.Gosched()" is something I did not have to bother about since windows 3.1, now it is back!
- reading and writing from channels (according to the documentation) invokes a co-operative multi-tasking so adding "runtime.Gosched()" at that point should have no effect but clearly it does have affect so the documentation is WRONG. Doing IO is not sufficient for not having to add "runtime.Gosched()" in loops.
- launching go routines had to be preceded by "runtime.Gosched()" to make it run "kind of" smooth.
- adding channels to simulate lock free concurrent processing by object degraded performance to a point of "entire project canceled".
- observed values of memstat percentage CPU spend on GC and cumulative duration of GC stops does not match (indicated 73% cpu usage during 200sec while cumulative time showed 0.47sec)
Variants:
- we tested with various combinations of inner and outer loop iteration count (example shows 1e51e5, changing that to 1e31e7, 1e41e6, 1e71e3 produced even stranger effects like immediate program abend without error message).
- compared to native C code (only the BurnCPU function, without adding it to existing interactive program) showed remarkable figures. best case GO was 373x slower, worst case was 2847x slower. (Native C lternative can not be listed here because uses proprietary templates for simulating channel based on circular buffer with read and write index advanced via interlocked increments).
Assumptions based on these observations:
- GO in its current state of development is only suited for interactive applications or applications that have idle time. Heavy load and large persistent heap of objects do not function properly in Golang at this moment.
- usage of channels is documented as native GO method to avoid deadlocks, while it seems to introduce extreme slow-down due to locking (tested both buffered and unbuffered variants with 0, 10 and 1000 items buffer)
- usage of map relative to fixed array with modulus index (simulating map) caused serious performance degradation in excess of 18x (unexpected due to map having no locking)
- GO was selected (candidate) for shared code on multi-platform development (windows,mac,iOS,android) with native GUI front-end (GO,swift,Kotlin) per platform. However the shared code needs to contain 1M..100M memory data structures describing 3D animated shapes. It seems GO is not up to the task while Rust (second candidate) is also not yet mature enougn for commercial use, so we are back to native C++ for this purpose.
- Documented advances on reducing the delay time of GC "stop the world" are "misleading". The advance is due to the way the stops are measured (making it such that GC calls intersect with idle time of the application). However functions that do not actively WORK with this logic (by calling "runtime.Gosched()" all the time) seem to block the start of GC cleanup WHILE BLOCKING ALL OTHER GO ROUTINES. So this documented method of measuring GC performance is misleading (put politely).