Skip to content

Commit 1b4025f

Browse files
committed
runtime: replace per-M workbuf cache with per-P gcWork cache
Currently, each M has a cache of the most recently used *workbuf. This is used primarily by the write barrier so it doesn't have to access the global workbuf lists on every write barrier. It's also used by stack scanning because it's convenient. This cache is important for write barrier performance, but this particular approach has several downsides. It's faster than no cache, but far from optimal (as the benchmarks below show). It's complex: access to the cache is sprinkled through most of the workbuf list operations and it requires special care to transform into and back out of the gcWork cache that's actually used for scanning and marking. It requires atomic exchanges to take ownership of the cached workbuf and to return it to the M's cache even though it's almost always used by only the current M. Since it's per-M, flushing these caches is O(# of Ms), which may be high. And it has some significant subtleties: for example, in general the cache shouldn't be used after the harvestwbufs() in mark termination because it could hide work from mark termination, but stack scanning can happen after this and *will* use the cache (but it turns out this is okay because it will always be followed by a getfull(), which drains the cache). This change replaces this cache with a per-P gcWork object. This gcWork cache can be used directly by scanning and marking (as long as preemption is disabled, which is a general requirement of gcWork). Since it's per-P, it doesn't require synchronization, which simplifies things and means the only atomic operations in the write barrier are occasionally fetching new work buffers and setting a mark bit if the object isn't already marked. This cache can be flushed in O(# of Ps), which is generally small. It follows a simple flushing rule: the cache can be used during any phase, but during mark termination it must be flushed before allowing preemption. This also makes the dispose during mutator assist no longer necessary, which eliminates the vast majority of gcWork dispose calls and reduces contention on the global workbuf lists. And it's a lot faster on some benchmarks: benchmark old ns/op new ns/op delta BenchmarkBinaryTree17 11963668673 11206112763 -6.33% BenchmarkFannkuch11 2643217136 2649182499 +0.23% BenchmarkFmtFprintfEmpty 70.4 70.2 -0.28% BenchmarkFmtFprintfString 364 307 -15.66% BenchmarkFmtFprintfInt 317 282 -11.04% BenchmarkFmtFprintfIntInt 512 483 -5.66% BenchmarkFmtFprintfPrefixedInt 404 380 -5.94% BenchmarkFmtFprintfFloat 521 479 -8.06% BenchmarkFmtManyArgs 2164 1894 -12.48% BenchmarkGobDecode 30366146 22429593 -26.14% BenchmarkGobEncode 29867472 26663152 -10.73% BenchmarkGzip 391236616 396779490 +1.42% BenchmarkGunzip 96639491 96297024 -0.35% BenchmarkHTTPClientServer 100110 70763 -29.31% BenchmarkJSONEncode 51866051 5251138 +1.24% BenchmarkJSONDecode 103813138 86094963 -17.07% BenchmarkMandelbrot200 4121834 4120886 -0.02% BenchmarkGoParse 16472789 5879949 -64.31% BenchmarkRegexpMatchEasy0_32 140 140 +0.00% BenchmarkRegexpMatchEasy0_1K 394 394 +0.00% BenchmarkRegexpMatchEasy1_32 120 120 +0.00% BenchmarkRegexpMatchEasy1_1K 621 614 -1.13% BenchmarkRegexpMatchMedium_32 209 202 -3.35% BenchmarkRegexpMatchMedium_1K 54889 55175 +0.52% BenchmarkRegexpMatchHard_32 2682 2675 -0.26% BenchmarkRegexpMatchHard_1K 79383 79524 +0.18% BenchmarkRevcomp 584116718 584595320 +0.08% BenchmarkTemplate 125400565 109620196 -12.58% BenchmarkTimeParse 386 387 +0.26% BenchmarkTimeFormat 580 447 -22.93% (Best out of 10 runs. The delta of averages is similar.) This also puts us in a good position to flush these caches when nearing the end of concurrent marking, which will let us increase the size of the work buffers while still controlling mark termination pause time. Change-Id: I2dd94c8517a19297a98ec280203cccaa58792522 Reviewed-on: https://go-review.googlesource.com/9178 Run-TryBot: Austin Clements <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Russ Cox <[email protected]>
1 parent d1cae63 commit 1b4025f

File tree

4 files changed

+58
-162
lines changed

4 files changed

+58
-162
lines changed

src/runtime/mgc.go

+20-6
Original file line numberDiff line numberDiff line change
@@ -501,7 +501,7 @@ func (c *gcControllerState) findRunnable(_p_ *p) *g {
501501
// else for a while, so kick everything out of its run
502502
// queue.
503503
} else {
504-
if _p_.m.ptr().currentwbuf == 0 && work.full == 0 && work.partial == 0 {
504+
if _p_.gcw.wbuf == 0 && work.full == 0 && work.partial == 0 {
505505
// No work to be done right now. This can
506506
// happen at the end of the mark phase when
507507
// there are still assists tapering off. Don't
@@ -1026,7 +1026,6 @@ func gcBgMarkWorker(p *p) {
10261026
// is set, this puts itself into _Gwaiting to be woken up by
10271027
// gcController.findRunnable at the appropriate time.
10281028
notewakeup(&work.bgMarkReady)
1029-
var gcw gcWork
10301029
for {
10311030
// Go to sleep until woken by gcContoller.findRunnable.
10321031
// We can't releasem yet since even the call to gopark
@@ -1055,18 +1054,19 @@ func gcBgMarkWorker(p *p) {
10551054
done := false
10561055
switch p.gcMarkWorkerMode {
10571056
case gcMarkWorkerDedicatedMode:
1058-
gcDrain(&gcw, gcBgCreditSlack)
1057+
gcDrain(&p.gcw, gcBgCreditSlack)
10591058
// gcDrain did the xadd(&work.nwait +1) to
10601059
// match the decrement above. It only returns
10611060
// at a mark completion point.
10621061
done = true
10631062
case gcMarkWorkerFractionalMode, gcMarkWorkerIdleMode:
1064-
gcDrainUntilPreempt(&gcw, gcBgCreditSlack)
1063+
gcDrainUntilPreempt(&p.gcw, gcBgCreditSlack)
10651064
// Was this the last worker and did we run out
10661065
// of work?
10671066
done = xadd(&work.nwait, +1) == work.nproc && work.full == 0 && work.partial == 0
10681067
}
1069-
gcw.dispose()
1068+
// We're not in mark termination, so there's no need
1069+
// to dispose p.gcw.
10701070

10711071
// If this worker reached a background mark completion
10721072
// point, signal the main GC goroutine.
@@ -1121,6 +1121,14 @@ func gcMark(start_time int64) {
11211121

11221122
gcCopySpans() // TODO(rlh): should this be hoisted and done only once? Right now it is done for normal marking and also for checkmarking.
11231123

1124+
// Gather all cached GC work. All other Ps are stopped, so
1125+
// it's safe to manipulate their GC work caches. During mark
1126+
// termination, these caches can still be used temporarily,
1127+
// but must be disposed to the global lists immediately.
1128+
for i := 0; i < int(gomaxprocs); i++ {
1129+
allp[i].gcw.dispose()
1130+
}
1131+
11241132
work.nwait = 0
11251133
work.ndone = 0
11261134
work.nproc = uint32(gcprocs())
@@ -1135,9 +1143,9 @@ func gcMark(start_time int64) {
11351143
helpgc(int32(work.nproc))
11361144
}
11371145

1138-
harvestwbufs() // move local workbufs onto global queues where the GC can find them
11391146
gchelperstart()
11401147
parfordo(work.markfor)
1148+
11411149
var gcw gcWork
11421150
gcDrain(&gcw, -1)
11431151
gcw.dispose()
@@ -1153,6 +1161,12 @@ func gcMark(start_time int64) {
11531161
notesleep(&work.alldone)
11541162
}
11551163

1164+
for i := 0; i < int(gomaxprocs); i++ {
1165+
if allp[i].gcw.wbuf != 0 {
1166+
throw("P has cached GC work at end of mark termination")
1167+
}
1168+
}
1169+
11561170
if trace.enabled {
11571171
traceGCScanDone()
11581172
}

src/runtime/mgcmark.go

+21-39
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ var oneptr = [...]uint8{typePointer}
5555

5656
//go:nowritebarrier
5757
func markroot(desc *parfor, i uint32) {
58+
// TODO: Consider using getg().m.p.ptr().gcw.
5859
var gcw gcWork
5960

6061
// Note: if you add a case here, please also update heapdump.go:dumproots.
@@ -172,6 +173,8 @@ func markroot(desc *parfor, i uint32) {
172173
// allocations performed by this mutator since the last assist.
173174
//
174175
// It should only be called during gcphase == _GCmark.
176+
//
177+
// This must be called with preemption disabled.
175178
//go:nowritebarrier
176179
func gcAssistAlloc(size uintptr, allowAssist bool) {
177180
// Find the G responsible for this assist.
@@ -228,19 +231,14 @@ func gcAssistAlloc(size uintptr, allowAssist bool) {
228231

229232
xadd(&work.nwait, -1)
230233

231-
// drain own current wbuf first in the hopes that it
234+
// drain own cached work first in the hopes that it
232235
// will be more cache friendly.
233-
var gcw gcWork
234-
gcw.initFromCache()
236+
gcw := &getg().m.p.ptr().gcw
235237
startScanWork := gcw.scanWork
236-
gcDrainN(&gcw, scanWork)
238+
gcDrainN(gcw, scanWork)
237239
// Record that we did this much scan work.
238240
gp.gcscanwork += gcw.scanWork - startScanWork
239-
// TODO(austin): This is the vast majority of our
240-
// disposes. Instead of constantly disposing, keep a
241-
// per-P gcWork cache (probably combined with the
242-
// write barrier wbuf cache).
243-
gcw.dispose()
241+
// No need to dispose since we're not in mark termination.
244242

245243
// If this is the last worker and we ran out of work,
246244
// signal a completion point.
@@ -315,21 +313,24 @@ func scanstack(gp *g) {
315313
throw("can't scan gchelper stack")
316314
}
317315

318-
var gcw gcWork
319-
gcw.initFromCache()
316+
gcw := &getg().m.p.ptr().gcw
317+
origBytesMarked := gcw.bytesMarked
318+
origScanWork := gcw.scanWork
320319
scanframe := func(frame *stkframe, unused unsafe.Pointer) bool {
321320
// Pick up gcw as free variable so gentraceback and friends can
322321
// keep the same signature.
323-
scanframeworker(frame, unused, &gcw)
322+
scanframeworker(frame, unused, gcw)
324323
return true
325324
}
326325
gentraceback(^uintptr(0), ^uintptr(0), 0, gp, 0, nil, 0x7fffffff, scanframe, nil, 0)
327326
tracebackdefers(gp, scanframe, nil)
328327
// Stacks aren't part of the heap, so don't count them toward
329328
// marked heap bytes.
330-
gcw.bytesMarked = 0
331-
gcw.scanWork = 0
332-
gcw.disposeToCache()
329+
gcw.bytesMarked = origBytesMarked
330+
gcw.scanWork = origScanWork
331+
if gcphase == _GCmarktermination {
332+
gcw.dispose()
333+
}
333334
gp.gcscanvalid = true
334335
}
335336

@@ -462,7 +463,6 @@ func gcDrain(gcw *gcWork, flushScanCredit int64) {
462463
credit := gcw.scanWork - lastScanFlush
463464
xaddint64(&gcController.bgScanCredit, credit)
464465
}
465-
checknocurrentwbuf()
466466
}
467467

468468
// gcDrainUntilPreempt blackens grey objects until g.preempt is set.
@@ -524,7 +524,6 @@ func gcDrainUntilPreempt(gcw *gcWork, flushScanCredit int64) {
524524
// scanning is always done in whole object increments.
525525
//go:nowritebarrier
526526
func gcDrainN(gcw *gcWork, scanWork int64) {
527-
checknocurrentwbuf()
528527
targetScanWork := gcw.scanWork + scanWork
529528
for gcw.scanWork < targetScanWork {
530529
// This might be a good place to add prefetch code...
@@ -646,33 +645,16 @@ func scanobject(b, n uintptr, ptrmask *uint8, gcw *gcWork) {
646645

647646
// Shade the object if it isn't already.
648647
// The object is not nil and known to be in the heap.
648+
// Preemption must be disabled.
649649
//go:nowritebarrier
650650
func shade(b uintptr) {
651651
if obj, hbits, span := heapBitsForObject(b); obj != 0 {
652-
// TODO: this would be a great place to put a check to see
653-
// if we are harvesting and if we are then we should
654-
// figure out why there is a call to shade when the
655-
// harvester thinks we are in a STW.
656-
// if atomicload(&harvestingwbufs) == uint32(1) {
657-
// // Throw here to discover write barriers
658-
// // being executed during a STW.
659-
// throw("shade during harvest")
660-
// }
661-
662-
var gcw gcWork
663-
greyobject(obj, 0, 0, hbits, span, &gcw)
664-
// This is part of the write barrier so put the wbuf back.
652+
gcw := &getg().m.p.ptr().gcw
653+
greyobject(obj, 0, 0, hbits, span, gcw)
665654
if gcphase == _GCmarktermination {
655+
// Ps aren't allowed to cache work during mark
656+
// termination.
666657
gcw.dispose()
667-
} else {
668-
// If we added any pointers to the gcw, then
669-
// currentwbuf must be nil because 1)
670-
// greyobject got its wbuf from currentwbuf
671-
// and 2) shade runs on the systemstack, so
672-
// we're still on the same M. If either of
673-
// these becomes no longer true, we need to
674-
// rethink this.
675-
gcw.disposeToCache()
676658
}
677659
}
678660
}

0 commit comments

Comments
 (0)