-
Notifications
You must be signed in to change notification settings - Fork 18k
cmd/compile: Fannkuch11 can be speed up by ~ 30% via improving BCE #24660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Fannkuch11 begins with a boundcheck that I thought was a low-hanging fruit: func s1(n int) []int {
s := make([]int, n)
for i := 0; i < n; i++ {
s[i] = i
}
return s
} I thought that proving @randall77 @mdempsky @aclements any suggestion on how to handle this? Is a new intermediate SSA op |
Some of our rewrite rules look into the load/store/call ops to extract useful information. see cmd/compile/internal/ssa/gen/generic.rules, particularly those for calling I think adding a SliceAlloc op might be better long term, but it is a bigger change. |
/cc @zdjones |
Change https://golang.org/cl/190657 mentions this issue: |
@navytux By my count, The CL above removes 6 of 12 bounds checks from Fannkuch. |
@zdjones, thanks. It is good to see that BCE is being worked on. |
Some BCEs in Funnkuch11 are possibly too hard, like those that require tracking that all values within a slice have a certain upper bound (since they are effectively slices of indices to another slice). @zdjones which are the ones left after your patch? |
I've marked the 12 remaining checks and noted which ones the patch removes.
|
Change https://golang.org/cl/195220 mentions this issue: |
Thanks! I noticed that this loop:
wasn't being detected by loopbce, and the fix was easy enough (just mailed). CL 195220 detects If we fix that, and manage to merge your trace CL series, I think we're basically done with Fannkuch, as tracking the contents of a slice is probably outside of what we can reasonably implement, and is possibly useful only for algorithms storing indexes to arrays within other arrays, which doesn't sound common enough to be worth design and resolving this problem. |
Awesome! I'm going to take a closer look at your CL, and that second bounds check, when I have the time. I suspect that CL will remove a decent number bounds checks from std/cmd as well? I think we may be able to get this one as well:
I'll investigate. My suspicion is that this is also a Phi issue. My plan is to try adding Phi to trace... maybe that will be enough. I agree about the remaining 3 checks; |
I actually have not had time to check.
The problem with this one is that the current induction variable logic can't handle a loop whose bounding condition is If trace grows to be able to handle this case, then it probably can substitute the whole induction variable detection code. |
So, the problem there is that, in prove, we don't update ft.limits when we record relations between dependencies not involving constants. For instance if learn that Eventually, we should get rid of |
Change https://golang.org/cl/196784 mentions this issue: |
Now that OpSliceMake is called by runtime.makeslice callers, prove can see and record the actual length and cap of each slice being constructed. This small patch is enough to remove 260 additional bound checks from cmd+std. Thanks to Martin Möhrmann for pointing me to CL141822 that I had missed. Updates #24660 Change-Id: I14556850f285392051f3f07d13b456b608b64eb9 Reviewed-on: https://go-review.googlesource.com/c/go/+/196784 Run-TryBot: Giovanni Bajo <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: David Chase <[email protected]>
prove wasn't able to detect induction variables that was bound by another inducation variable. This happened because an indvar is a Phi, and thus in case of a dependency, the loop bounding condition looked as Phi < Phi. This triggered an existing codepath that checked whether the upper bound was a Phi to detect loop conditions written in reversed order respect to the idiomatic way (eg: for i:=0; len(n)>i; i++). To fix this, we call the indvar pattern matching on both operands of the loop condition, so that the first operand that matches will be treated as the indvar. Updates #24660 (removes a boundcheck from Fannkuch) Change-Id: Iade83d8deb54f14277ed3f2e37b190e1ed173d11 Reviewed-on: https://go-review.googlesource.com/c/go/+/195220 Reviewed-by: David Chase <[email protected]>
After the two CLs that just landed, I think we're only missing the check analyzed here among those that are remotely feasible. I'm still experimenting with bound calculations in poset, but it's getting hairy. Not sure if I'll manage to land it, it might not be worth it, depending on how good it turns out to be when it's done. If I don't land it, I have another CL that improve ft.limits a little bit, enough to tackle this case and close this bug for good. |
This is spin-off issue from #20393. There in the discussion it was shown that Fannkuch11 benchmark from go1 benchmark suite runs ~30% faster with
-gcflags=-B
. A proof was also provided showing that all the bounds checks emitted by the compiler are actually not needed, and thus with clever BCE it should be possible to reach ~30% Fannkuch11 speedup level demonstrated by-gcflags=-B
.That discussion is from May 2017. However I've just rechecked with today's go tip
and current state is:
in other words there is still ~30% room for BCE related improvement.
Below comes verbatim copy of my original proof that all bounds checks are not actually needed in Fannkuch case. Even though it discusses code emitted by old compilers, all points about bounds checks not needed should stand still valid today.
Thanks,
Kirill
/cc @josharian, @rasky, @dr2chase, @brtzsnr, @aclements
---- 8< ----
Josh thanks for your feedback and for trying the compiler with -B. About fannkuch I've tried to look inside to see details about bounds checking. For this I've compiled
fannkuch_test.go
without and with -B, didgo tool objdump -S
on both, removed addresses of instruction so that the diff is less noisy, and investigated the difference. Let me show it below with my annotations and comments:BCE could remove ^^^ panicindex because:
perm1
is initialized withmake([]int, n)
and its len is never changedso BCE knows n <= len(perm1) (actually ==), 0 <= i < n -> i < len(perm1) -> no BC needed.
For
count[r-1] = r
BCE could remove ^^^ panicindex because:count
is initialized withmake([]int, n)
and its len is never changedr
is initialized asn
, goes down to 1 in below loop, goes up in tail loop inside main loopfor ; r < n; r++
, so it is never > n. => r ∈ [1, n] always.so BCE knows
1 <= r <= n
0 <= r-1 <= n-1 < n
so
count[r-1]
does not need BC.len(perm1)
is known to ben
-> this way perm1[0] does not need BC.
n1
is initialized asn - 1
and is never changedlen(perm1)
is once again known to ben
-> this way
perm1[n1]
does not need BCperm
andperm1
were initialized asmake([]int, n)
and their len is never changed-> both
perm[i]
andperm1[i]
do not need BCk is initialized as
perm1[0]
- no BC needed here since compiler already knows len(perm1) = n >= 1initially perm1 is initialized as
perm1[i] = i
i ∈ [0, n)from ^^^ the compiler knows initially ∀ x ∈ [0, n) perm1[x] ∈ [0, n)
later elements of perm1 are only exchanged with each other, so the compiler can prove ∀ x ∈ [0, n) perm1[x] ∈ [0, n) holds always
thus k itself can be proven to be in [0, n) on first iteration (see also below about all next iterations)
the inner loop is
-> both i and j are in valid range and no BC is needed for
perm[i]
andperm[j]
perm[k]
does not need BCperm
is initialized as =perm1
- this way compiler can know ∀ x ∈ [0, n)perm[x]
∈ [0, n) initiallyj := perm[k]
j
is thus ∈ [0, n)perm[k] = k
preserves ∀ x ∈ [0, n)perm[x]
∈ [0, n) property.k = j
compiler can see k continues to k ∈ [0, n) in next iterationsfor ; r < n; r++
for i := 0; i < r; i++
perm1[i]
andperm1[i+1]
does not need BCperm1[i] = perm1[i+1]
the compiler can see∀ x ∈ [0, n) perm1[x] ∈ [0, n)
property is preservedr
is initially set ton
thus after this we know (at least on first mainloop iteration, see below about next)
r = 1
furtnerthe outer loop, once again, is
for ; r < n; r++
thus inside r ∈ [1, n)
thus
perm1[r]
andcount[r]
do not need BCthe outer loop only increases r which was initially known to be = 1, so after it is known to be >= 1
this way when main loop runs next time in
the compiler can prove, since
r >= 1
the loop will terminate with r==1, not loop forever^^^ are all
CALL runtime.panicindex
destinations and they were all covered by my comments.So maybe I missed something (appologize then) but above I was able to prove all runtime bounds check are unneccessary in fannkuch. I don't see any reason the compiler cannot be taught to do the same. Thus for fannkuch it is theoretically possible to reach -B level with just BCE.
The text was updated successfully, but these errors were encountered: