-
Notifications
You must be signed in to change notification settings - Fork 18k
cmd/compile: slice update with self subslice - write barrier probably not needed (?) #24314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
While running make.bash, over 5% of all pointer writes come from encoding/binary doing struct reads. This change replaces slicing during such reads with an offset. This avoids updating the slice pointer with every struct field read or write. This has no impact when the write barrier is off. Running the benchmarks with GOGC=1, however, shows significant improvement: name old time/op new time/op delta ReadStruct-8 13.2µs ± 6% 10.1µs ± 5% -23.24% (p=0.000 n=10+10) name old speed new speed delta ReadStruct-8 5.69MB/s ± 6% 7.40MB/s ± 5% +30.18% (p=0.000 n=10+10) Change-Id: I22904263196bfeddc38abe8989428e263aee5253 Reviewed-on: https://go-review.googlesource.com/98757 Run-TryBot: Josh Bleecher Snyder <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]>
@aclements, would you mind to please provide your feedback here? If there is a chance write barriers could be avoided on slice updates it would be a pity to miss 1.11, because, even though Go is compiled, performance is sometimes too painfull, especially when GC bits are intermixed. Thanks beforehand, |
@navytux, good call. You're right that we can do this optimization now. It wasn't safe the last time we considered it, but the barrier has changed since then. Consider two goroutines: goroutine A does
goroutine B does
(Ignoring len and cap, which a race could obviously shear, but that's an orthogonal problem.) Since the reads commute with each other and everything commutes with shade, the only orderings that matter are A1/B5, A2/B2, and A2/B5. This leads to four schedules worth considering:
Prior to the hybrid barrier, we didn't have step B3, which is what makes the third schedule above safe. |
Change https://golang.org/cl/105258 mentions this issue: |
@aclements thanks a lot for feedback and coming with the patch. I'm glad my observation could be useful. @josharian thanks also for triggering whole this topic with b854339. |
Status update: CL 105258 started to implement the idea but also along the way decided to generalize WB relax rule so that anything could happen between corresponding load and store in the same goroutine, including other writes to pointer in question. That generalization however brought correctness issues because now the race problem could be e.g. 3-body problem of 2 goroutines and GC, for example: (
where ( https://go-review.googlesource.com/c/go/+/105258/3/src/cmd/compile/internal/ssa/writebarrier.go#49 ) I do not have exact proof, but if we allow to omit generating write barrier only when:
there won't be situations where intermediate buf.ptr instability could be leaking to another goroutine without being eventually marked, like in example above for G1 -> G2 leak of B:
@aclements I understand we are in code freeze now, but CL 105258 was mailed before the freeze and so it would be a pity not to get it finished for Go1.11. Like I already said even though Go is compiled, performance is sometimes too painfull, especially when GC bits are intermixed, especially taking into account that slice self-update idiom is very common. It would be a pity to miss relaxing write-barriers for this common case ... I appologize if I miss something and this optimization still cannot be applied even with my corrected rule. Thanks beforehand, /cc @cherrymui |
Silence... "words are very unnecessary - they can only do harm..." |
This is still the case as of package issue24314
type Sumit struct {
v []int
}
func (s *Sumit) SumAndFlush() int {
r := 0
for len(s.v) > 0 {
r += s.v[0]
s.v = s.v[1:] // <-- NOTE HERE
}
return r
} #include "funcdata.h"
TEXT ·(*Sumit).SumAndFlush(SB), ABIInternal, $8-16 // bufself.go:7
NO_LOCAL_POINTERS
pc0:
// MOVQ (TLS), CX (stack growth prologue)
// CMPQ SP, 16(CX)
// PCDATA $0, $-2
// JLS 128
// PCDATA $0, $-1
// SUBQ $8, SP
// MOVQ BP, (SP) (BP save)
// LEAQ (SP), BP (BP init)
// FUNCDATA $0, gclocals·a36216b97439c93dafebe03e7f0808b5(SB) (args)
// FUNCDATA $1, gclocals·33cdeccccebe80329f1fdbee7f5874cb(SB) (no locals)
MOVQ s+16(SP), AX // bufself.go:9
XORL CX, CX
pc34:
MOVQ 8(AX), DX
MOVQ (AX), BX
MOVQ 16(AX), SI
TESTQ DX, DX
JLE pc109
MOVQ (BX), R8 // bufself.go:10
DECQ DX // bufself.go:11
MOVQ DX, 8(AX)
LEAQ -1(SI), DX
MOVQ DX, 16(AX)
ADDQ R8, CX // bufself.go:10
NEGQ DX // bufself.go:11
SARQ $63, DX
ANDQ $8, DX
ADDQ BX, DX
PCDATA $0, $-2
CMPL runtime.writeBarrier(SB), $0 // <-- NOTE HERE
JNE pc99
MOVQ DX, (AX)
JMP pc34
pc99:
MOVQ AX, DI
CALL runtime.gcWriteBarrierDX(SB) // <-- NOTE HERE
JMP pc34
pc109:
PCDATA $0, $-1 // bufself.go:13
MOVQ CX, _r0+24(SP)
// MOVQ (SP), BP (BP restore)
// ADDQ $8, SP (SP restore)
RET
NOP
PCDATA $1, $-1 // bufself.go:7
PCDATA $0, $-2
NOP
CALL runtime.morestack_noctxt(SB)
PCDATA $0, $-1
JMP pc0 |
Commit b854339 (CL 98757) changes encoding/binary to use explicit offset instead of
buf = buf[1:]
idiom with the rationale thatbuf = buf[1:]
is slice data pointer update which in turn is done with write barrier, while usingbuf[offset]
explicitly does not need to update buf slice pointer and so does not need to issue GC write barrier. The commit cites significant speedup for encoding/binary under GOGC=1 and notes that While running make.bash, over 5% of all pointer writes come from encoding/binary doing struct reads.It was then noted there that instead of explicitly using offset and avoiding
buf = buf[1:]
update idiom for performance reasons, it would be better to instead teach the compiler to know that slice updates with destination array pointer pointing to the same array's underlying object as of original slice, e.g. as inbuf = buf[1:]
does not need write barrier at all.The conversation there contains the rationale for current compiler behaviour but also brings questions and arguments for why it can be done. It was suggested to move the conversation to the issues, so here it goes.
@josharian thinking out loud maybe it is better to instead teach the compiler to know that slice updates with destination array pointer pointing to the same array's underlying object as of original slice, e.g. as in
does not need write barrier at all?
The rationale (to my limited write barriers understanding) is that it is whole allocated object that is shaded by write barrier:
(https://github.com/golang/go/blob/d7eb4901/src/runtime/mbarrier.go#L21)
(https://github.com/golang/go/blob/d7eb4901/src/runtime/mgcmark.go#L1211)
(https://github.com/golang/go/blob/d7eb4901/src/runtime/mbitmap.go#L354)
and so for
buf = buf[1:]
the slice, either before or after the update, will be pointing to inside the same allocated object -> thus the mutator for sure won't hide the allocated object from GC and so there is no need to shade it with such updates.Doing so in the compiler will fix the performance issue this patch solves, as well as automatically remove write barriers in other, probably many, places thus helping not only speed but also code size (#6853).
I appologize if there is something trivial preventing doing this, that I missed.
Kirill
/cc @aclements
@aclements says:
The danger here is subtle. We haven't done this because the update to buf could be racy and if we did this that race could break GC. Specifically, if one goroutine executes
buf = buf[1:]
without a write barrier while another executesbuf = x
with a write barrier, it's possible for the final value ofbuf
to continue pointing to the original buf allocation, but for the garbage collector to think it points to thex
allocation and free the originalbuf
allocation.Of course, racy Go programs are undefined, but right now the garbage collector is immune to these sorts of non-type-safety-breaking "benign" races (this is obviously a somewhat fuzzy argument, since races let you break type safety and that lets you break memory safety, and saying anything formal about "benign" races is notoriously impossible so what I'm saying is probably not true anyway, but we try :)
@dgryski says:
@aclements It would be interesting to see the effect such an update would have on code size and performance to know how much we're leaving on the table with this safety check on.
me says:
@aclements thanks for explaining. Am I right that with write-barriers enabled for both racy buf updates, the mechanism that prevents GC from deallocating wrong object is that in the current GC written
ptr
is always shaded too (https://github.com/golang/go/blob/go1.10-1-g678dede7bc/src/runtime/mbarrier.go#L162 (go1.10), https://github.com/golang/go/blob/0add9a4dcf/src/runtime/mwbbuf.go#L226 (tip)) ?By the way the thread that performs
buf = x
with write barrier will always doshade(*slot)
. If there isbuf = buf[1:]
racing withbuf = x
, even if the former does not use write barrier, theshade(*slot)
ofbuf = x
thread will shade originalbuf
object in any case (it either seesbuf
orbuf+1
as*slot
but they both actually lead to the same allocated object). If so, sinceshade
actually greys an object and grey objects never become white - only eventually black, we can say that originalbuf
underlying object will stay alive - not deallocated, and so it is safe to dobuf = buf[1:]
without write barrier.But this goes contrary to what you say, and so there is probably some mistake on my side. Could you please explain where the above logic fails?
I'm GC newbie so please forgive me if I miss something well-known.
@josharian says:
@navytux would you mind opening a new issue and migrating this (interesting) discussion there? We tend to try to keep all conversation on CLs and issues. Thanks!
The text was updated successfully, but these errors were encountered: