cmd/compile: Fannkuch11 can be speed up by ~ 30% via improving BCE #24660

navytux · 2018-04-03T10:21:43Z

This is spin-off issue from #20393. There in the discussion it was shown that Fannkuch11 benchmark from go1 benchmark suite runs ~30% faster with -gcflags=-B. A proof was also provided showing that all the bounds checks emitted by the compiler are actually not needed, and thus with clever BCE it should be possible to reach ~30% Fannkuch11 speedup level demonstrated by -gcflags=-B.

That discussion is from May 2017. However I've just rechecked with today's go tip

$ gotip version
go version devel +26e0e8a840 Tue Apr 3 05:45:44 2018 +0000 linux/amd64

and current state is:

.../gotip/test/bench/go1$ gotip test -bench Fannkuch -count 10 >ftip.txt
.../gotip/test/bench/go1$ gotip test -gcflags=-B -bench Fannkuch -count 10 >ftipB.txt

.../gotip/test/bench/go1$ benchstat ftip.txt ftipB.txt 
name          old time/op  new time/op  delta
Fannkuch11-4   2.70s ± 1%   1.96s ± 0%  -27.42%  (p=0.000 n=10+10)

in other words there is still ~30% room for BCE related improvement.

Below comes verbatim copy of my original proof that all bounds checks are not actually needed in Fannkuch case. Even though it discusses code emitted by old compilers, all points about bounds checks not needed should stand still valid today.

Thanks,
Kirill

/cc @josharian, @rasky, @dr2chase, @brtzsnr, @aclements

---- 8< ----

Josh thanks for your feedback and for trying the compiler with -B. About fannkuch I've tried to look inside to see details about bounds checking. For this I've compiled fannkuch_test.go without and with -B, did go tool objdump -S on both, removed addresses of instruction so that the diff is less noisy, and investigated the difference. Let me show it below with my annotations and comments:

@@ -2,259 +2,172 @@ TEXT %22%22.fannkuch(SB) gofile..$GOROOT/test/bench/go1/fannkuch_test.go
 func fannkuch(n int) int {
                        64488b0c2500000000      MOVQ FS:0, CX           [5:9]R_TLS_LE
                        483b6110                CMPQ 0x10(CX), SP
-                       0f8623030000            JBE 0x2508              
-                       4881ec80000000          SUBQ $0x80, SP          
-                       48896c2478              MOVQ BP, 0x78(SP)       
-                       488d6c2478              LEAQ 0x78(SP), BP       
-                       488b842488000000        MOVQ 0x88(SP), AX       
+                       0f86d5010000            JBE 0x235a              
+                       4883ec48                SUBQ $0x48, SP          
+                       48896c2440              MOVQ BP, 0x40(SP)       
+                       488d6c2440              LEAQ 0x40(SP), BP       
+                       488b442450              MOVQ 0x50(SP), AX       
        if n < 1 {
                        4883f801                CMPQ $0x1, AX
-                       0f8ca8020000            JL 0x24b0               
+                       0f8ca5010000            JL 0x2347               
                        488d0d00000000          LEAQ 0(IP), CX          [3:7]R_PCREL:type.int
        perm := make([]int, n)
                        48890c24                MOVQ CX, 0(SP)
                        4889442408              MOVQ AX, 0x8(SP)
                        4889442410              MOVQ AX, 0x10(SP)
-                       e800000000              CALL 0x2222             [1:5]R_CALL:runtime.makeslice
-                       488b442420              MOVQ 0x20(SP), AX       
+                       e800000000              CALL 0x21bc             [1:5]R_CALL:runtime.makeslice
+                       488b442418              MOVQ 0x18(SP), AX       
                        4889442438              MOVQ AX, 0x38(SP)
-                       488b4c2418              MOVQ 0x18(SP), CX       
-                       48894c2468              MOVQ CX, 0x68(SP)       
-                       488d1500000000          LEAQ 0(IP), DX          [3:7]R_PCREL:type.int
+                       488d0d00000000          LEAQ 0(IP), CX          [3:7]R_PCREL:type.int
        perm1 := make([]int, n)
-                       48891424                MOVQ DX, 0(SP)          
-                       488b9c2488000000        MOVQ 0x88(SP), BX       
-                       48895c2408              MOVQ BX, 0x8(SP)        
-                       48895c2410              MOVQ BX, 0x10(SP)       
-                       e800000000              CALL 0x2258             [1:5]R_CALL:runtime.makeslice
-                       488b442418              MOVQ 0x18(SP), AX       
-                       4889442460              MOVQ AX, 0x60(SP)       
-                       488b4c2420              MOVQ 0x20(SP), CX       
-                       48894c2430              MOVQ CX, 0x30(SP)       
-                       488d1500000000          LEAQ 0(IP), DX          [3:7]R_PCREL:type.int
-       count := make([]int, n)
-                       48891424                MOVQ DX, 0(SP)          
-                       488b942488000000        MOVQ 0x88(SP), DX       
+                       48890c24                MOVQ CX, 0(SP)          
+                       488b542450              MOVQ 0x50(SP), DX       
                        4889542408              MOVQ DX, 0x8(SP)
                        4889542410              MOVQ DX, 0x10(SP)
-                       e800000000              CALL 0x228e             [1:5]R_CALL:runtime.makeslice
-                       488b442420              MOVQ 0x20(SP), AX       
-                       488b4c2418              MOVQ 0x18(SP), CX       
-                       488b942488000000        MOVQ 0x88(SP), DX       
-                       488b5c2460              MOVQ 0x60(SP), BX       
-                       488b742430              MOVQ 0x30(SP), SI       
-                       31ff                    XORL DI, DI             
+                       e800000000              CALL 0x21e5             [1:5]R_CALL:runtime.makeslice
+                       488b442418              MOVQ 0x18(SP), AX       
+                       4889442430              MOVQ AX, 0x30(SP)       
+                       488d0d00000000          LEAQ 0(IP), CX          [3:7]R_PCREL:type.int
+       count := make([]int, n)
+                       48890c24                MOVQ CX, 0(SP)          
+                       488b4c2450              MOVQ 0x50(SP), CX       
+                       48894c2408              MOVQ CX, 0x8(SP)        
+                       48894c2410              MOVQ CX, 0x10(SP)       
+                       e800000000              CALL 0x220e             [1:5]R_CALL:runtime.makeslice
+                       488b442418              MOVQ 0x18(SP), AX       
+                       488b4c2450              MOVQ 0x50(SP), CX       
+                       488b542430              MOVQ 0x30(SP), DX       
+                       31db                    XORL BX, BX             
        for i := 0; i < n; i++ {
-                       eb07                    JMP 0x22b5              
+                       eb07                    JMP 0x2228              
                perm1[i] = i // initial (trivial) permutation
-                       48893cfb                MOVQ DI, 0(BX)(DI*8)    
+                       48891cda                MOVQ BX, 0(DX)(BX*8)    
        for i := 0; i < n; i++ {
-                       48ffc7                  INCQ DI                 
-                       4839d7                  CMPQ DX, DI             
-                       7d0a                    JGE 0x22c4              
-               perm1[i] = i // initial (trivial) permutation
-                       4839f7                  CMPQ SI, DI             
-                       72ef                    JB 0x22ae               
-                       e93d020000              JMP 0x2501                      // panicindex
-                       48894c2470              MOVQ CX, 0x70(SP)       
+                       48ffc3                  INCQ BX                 
+                       4839cb                  CMPQ CX, BX             
+                       7cf4                    JL 0x2221

BCE could remove ^^^ panicindex because:

perm1 is initialized with make([]int, n) and its len is never changed
then it is initialized in a loop

        for i := 0; i < n; i++ {
                perm1[i] = i // initial (trivial) permutation
        }

so BCE knows n <= len(perm1) (actually ==), 0 <= i < n -> i < len(perm1) -> no BC needed.

        n1 := n - 1
-                       488d7aff                LEAQ -0x1(DX), DI       
-                       48897c2440              MOVQ DI, 0x40(SP)       
-                       4989d0                  MOVQ DX, R8             
-                       4531c9                  XORL R9, R9             
-                       4531d2                  XORL R10, R10           
+                       488d59ff                LEAQ -0x1(CX), BX       
+                       4889ce                  MOVQ CX, SI             
+                       31ff                    XORL DI, DI             
+                       4531c0                  XORL R8, R8             
        for {
-                       e986010000              JMP 0x2466              
+                       e9cf000000              JMP 0x230d              
                        count[r-1] = r
-                       488954d1f8              MOVQ DX, -0x8(CX)(DX*8) 
-                       4c89da                  MOVQ R11, DX            
+                       48894cc8f8              MOVQ CX, -0x8(AX)(CX*8) 
+                       48ffc9                  DECQ CX                 
                for ; r != 1; r-- {
-                       4883fa01                CMPQ $0x1, DX           
-                       740e                    JE 0x22fc               
-                       count[r-1] = r
-                       4c8d5aff                LEAQ -0x1(DX), R11      
-                       4939c3                  CMPQ AX, R11            
-                       72e9                    JB 0x22e0               
-                       e9fe010000              JMP 0x24fa                      // panicindex
+                       4883f901                CMPQ $0x1, CX           
+                       75f2                    JNE 0x223e

For count[r-1] = r BCE could remove ^^^ panicindex because:

count is initialized with make([]int, n) and its len is never changed
r is initialized as n, goes down to 1 in below loop, goes up in tail loop inside main loop for ; r < n; r++, so it is never > n. => r ∈ [1, n] always.
then there is loop

        for ; r != 1; r-- {
                count[r-1] = r
        }

so BCE knows

1 <= r <= n
-> 0 <= r-1 <= n-1 < n

so count[r-1] does not need BC.

                if perm1[0] != 0 && perm1[n1] != n1 {
-                       4885f6                  TESTQ SI, SI                    
-                       0f86ee010000            JBE 0x24f3                      // panicindex
-                       4c8b1b                  MOVQ 0(BX), R11                 
-                       4d85db                  TESTQ R11, R11                  
-                       0f8490010000            JE 0x24a1                       
-                       4839f7                  CMPQ SI, DI                     
-                       0f83d9010000            JAE 0x24f3                      // panicindex
-                       4e8b5cc3f8              MOVQ -0x8(BX)(R8*8), R11        
-                       4c39df                  CMPQ R11, DI                    
-                       0f846a010000            JE 0x2492                       
-                       4c8b5c2468              MOVQ 0x68(SP), R11              
-                       4c8b642438              MOVQ 0x38(SP), R12              
-                       41bd01000000            MOVL $0x1, R13                  
+                       4c8b0a                  MOVQ 0(DX), R9          
+                       4d85c9                  TESTQ R9, R9            
+                       0f84e5000000            JE 0x233d               
+                       4c8b4cf2f8              MOVQ -0x8(DX)(SI*8), R9 
+                       4c39cb                  CMPQ R9, BX             
+                       0f84cd000000            JE 0x2333               
+                       4c8b4c2438              MOVQ 0x38(SP), R9       
+                       41ba01000000            MOVL $0x1, R10

len(perm1) is known to be n
n is known to be >= 1 because in the beginning of fannkuch there is

        if n < 1 {
                return 0
        }

-> this way perm1[0] does not need BC.

n1 is initialized as n - 1 and is never changed
thus it is known 0 <= n1 <= n -1 < n
len(perm1) is once again known to be n

-> this way perm1[n1] does not need BC

                        for i := 1; i < n; i++ { // perm = perm1
-                       eb07                    JMP 0x2341              
+                       eb0b                    JMP 0x227e              
                                perm[i] = perm1[i]
-                       4f8934eb                MOVQ R14, 0(R11)(R13*8) 
+                       4e8b1cd2                MOVQ 0(DX)(R10*8), R11  
+                       4f891cd1                MOVQ R11, 0(R9)(R10*8)  
                        for i := 1; i < n; i++ { // perm = perm1
-                       49ffc5                  INCQ R13                
-                       4d39c5                  CMPQ R8, R13            
-                       7d17                    JGE 0x235d              
-                               perm[i] = perm1[i]
-                       4939f5                  CMPQ SI, R13            
-                       0f839d010000            JAE 0x24ec                      // panicindex
-                       4e8b34eb                MOVQ 0(BX)(R13*8), R14  
-                       4d39e5                  CMPQ R12, R13           
-                       72e2                    JB 0x233a               
-                       e98f010000              JMP 0x24ec                      // panicindex
-                       4c894c2450              MOVQ R9, 0x50(SP)       
+                       49ffc2                  INCQ R10                
+                       4939f2                  CMPQ SI, R10            
+                       7cf0                    JL 0x2273

i ∈ [1, n)
both perm and perm1 were initialized as make([]int, n) and their len is never changed

-> both perm[i] and perm1[i] do not need BC

                        k := perm1[0] // cache perm[0] in k
-                       4c8b2b                  MOVQ 0(BX), R13         
-                       4531f6                  XORL R14, R14           
+                       4c8b12                  MOVQ 0(DX), R10         
+                       4531db                  XORL R11, R11           
                        for {         // k!=0 ==> k>0
-                       eb6d                    JMP 0x23d7              
+                       eb32                    JMP 0x22bd              
                                        perm[i], perm[j] = perm[j], perm[i]
-                       4b8b0cfb                MOVQ 0(R11)(R15*8), CX  
-                       49890cfb                MOVQ CX, 0(R11)(DI*8)   
-                       4f890cfb                MOVQ R9, 0(R11)(R15*8)  
+                       4f8b34e1                MOVQ 0(R9)(R12*8), R14  
+                       4f8b3ce9                MOVQ 0(R9)(R13*8), R15  
+                       4f8934e9                MOVQ R14, 0(R9)(R13*8)  
+                       4f893ce1                MOVQ R15, 0(R9)(R12*8)  
                                for i, j := 1, k-1; i < j; i, j = i+1, j-1 {
-                       488d4f01                LEAQ 0x1(DI), CX        
-                       49ffcf                  DECQ R15                
-                       488b7c2440              MOVQ 0x40(SP), DI       
-                       4c8b4c2450              MOVQ 0x50(SP), R9       
-                       48894c2448              MOVQ CX, 0x48(SP)       
-                       488b4c2470              MOVQ 0x70(SP), CX       
-                       488b7c2448              MOVQ 0x48(SP), DI       
-                       4c39ff                  CMPQ R15, DI            
-                       7d17                    JGE 0x23b2              
-                                       perm[i], perm[j] = perm[j], perm[i]
-                       4c39e7                  CMPQ R12, DI            
-                       0f8341010000            JAE 0x24e5                      // panicindex
-                       4d8b0cfb                MOVQ 0(R11)(DI*8), R9   
-                       4d39e7                  CMPQ R12, R15           
-                       72bd                    JB 0x236a               
-                       e933010000              JMP 0x24e5                      // panicindex

k is initialized as perm1[0] - no BC needed here since compiler already knows len(perm1) = n >= 1
initially perm1 is initialized as perm1[i] = i i ∈ [0, n)
from ^^^ the compiler knows initially ∀ x ∈ [0, n) perm1[x] ∈ [0, n)
later elements of perm1 are only exchanged with each other, so the compiler can prove ∀ x ∈ [0, n) perm1[x] ∈ [0, n) holds always
thus k itself can be proven to be in [0, n) on first iteration (see also below about all next iterations)
the inner loop is

        for i, j := 1, k-1; i < j; i, j = i+1, j-1 {
                perm[i], perm[j] = perm[j], perm[i]
        }

j starts from k-1 < n-1 and is only decreasing
i starts from 1 and is only increasing
the loop is till i < j
thus inside loop:
- 1 < j < n - 1
- 1 <= i < n - 1

-> both i and j are in valid range and no BC is needed for perm[i] and perm[j]

-                               j := perm[k]
+                       49ffc5                  INCQ R13                
+                       49ffcc                  DECQ R12                
                        4d39e5                  CMPQ R12, R13
-                       0f8323010000            JAE 0x24de                      // panicindex
-                       4b8b3ceb                MOVQ 0(R11)(R13*8), DI  
+                       7ce5                    JL 0x228b               
+                               j := perm[k]
+                       4f8b24d1                MOVQ 0(R9)(R10*8), R12

k exchange is

        // Now exchange k (caching perm[0]) and perm[k]... with care!
        j := perm[k]
        perm[k] = k
        k = j

we know initially (first iter - see above) k ∈ [0, n)
thus perm[k] does not need BC
before loop perm is initialized as = perm1 - this way compiler can know ∀ x ∈ [0, n) perm[x] ∈ [0, n) initially
in j := perm[k] j is thus ∈ [0, n)
perm[k] = k preserves ∀ x ∈ [0, n) perm[x] ∈ [0, n) property.
from k = j compiler can see k continues to k ∈ [0, n) in next iterations

                                perm[k] = k
-                       4f892ceb                MOVQ R13, 0(R11)(R13*8) 
+                       4f8914d1                MOVQ R10, 0(R9)(R10*8)  
                                flips++
-                       4d8d6e01                LEAQ 0x1(R14), R13      
+                       4d8d5301                LEAQ 0x1(R11), R10      
                                if k == 0 {
-                       4885ff                  TESTQ DI, DI            
-                       7425                    JE 0x23f1               
-                       4d89ee                  MOVQ R13, R14           
-                       4989fd                  MOVQ DI, R13            
-                       488b7c2440              MOVQ 0x40(SP), DI       
+                       4d85e4                  TESTQ R12, R12          
+                       7412                    JE 0x22c9               
+                       4d89d3                  MOVQ R10, R11           
+                       4d89e2                  MOVQ R12, R10           
                                for i, j := 1, k-1; i < j; i, j = i+1, j-1 {
-                       4d8d7dff                LEAQ -0x1(R13), R15     
-                       4889442458              MOVQ AX, 0x58(SP)       
-                       b801000000              MOVL $0x1, AX           
-                       4889442448              MOVQ AX, 0x48(SP)       
-                       488b442458              MOVQ 0x58(SP), AX       
-                       eba0                    JMP 0x2391              
+                       4d8d62ff                LEAQ -0x1(R10), R12     
+                       41bd01000000            MOVL $0x1, R13          
+                       ebd8                    JMP 0x22a1              
                        if flipsMax < flips {
-                       4d39ea                  CMPQ R13, R10           
-                       7c56                    JL 0x244c               
-                       e992000000              JMP 0x248d              
+                       4d39d0                  CMPQ R10, R8            
+                       7c2a                    JL 0x22f8               
+                       eb5e                    JMP 0x232e              
                                perm1[i] = perm1[i+1]
-                       4e893cd3                MOVQ R15, 0(BX)(R10*8)  
-                       4d89f2                  MOVQ R14, R10           
+                       4e8b64da08              MOVQ 0x8(DX)(R11*8), R12        
+                       4e8924da                MOVQ R12, 0(DX)(R11*8)          
+                       49ffc3                  INCQ R11                        
                        for i := 0; i < r; i++ {
-                       4939d2                  CMPQ DX, R10            
-                       7d1c                    JGE 0x2423              
-                               perm1[i] = perm1[i+1]
-                       4d8d7201                LEAQ 0x1(R10), R14              
-                       4939f6                  CMPQ SI, R14                    
-                       0f83c3000000            JAE 0x24d7                      // panicindex
-                       4e8b7cd308              MOVQ 0x8(BX)(R10*8), R15        
-                       4939f2                  CMPQ SI, R10                    
-                       72dd                    JB 0x23fb                       
-                       e9b4000000              JMP 0x24d7                      // panicindex
+                       4939cb                  CMPQ CX, R11            
+                       7cef                    JL 0x22d0

outer loop is for ; r < n; r++
inner loop is for i := 0; i < r; i++
inside loop r is thus < n, i.e. <= n-1
inside inner loop i is thus < r, i.e. < n-1, i.e. ∈ [0, n-1)
thus both perm1[i] and perm1[i+1] does not need BC
once again on perm1[i] = perm1[i+1] the compiler can see ∀ x ∈ [0, n) perm1[x] ∈ [0, n) property is preserved

                        perm1[r] = perm0
-                       4839f2                  CMPQ SI, DX             
-                       0f83a4000000            JAE 0x24d0              
-                       48893cd3                MOVQ DI, 0(BX)(DX*8)    
+                       4c8904ca                MOVQ R8, 0(DX)(CX*8)    
                        count[r]--
-                       4839c2                  CMPQ AX, DX             
-                       0f8390000000            JAE 0x24c9                      // panicindex
-                       488b3cd1                MOVQ 0(CX)(DX*8), DI    
-                       48ffcf                  DECQ DI                 
-                       48893cd1                MOVQ DI, 0(CX)(DX*8)    
+                       4c8b04c8                MOVQ 0(AX)(CX*8), R8    
+                       49ffc8                  DECQ R8                 
+                       4c8904c8                MOVQ R8, 0(AX)(CX*8)

outside main loop r is initially set to n
the main loop contains

        for ; r != 1; r-- {
                count[r-1] = r
        }

thus after this we know (at least on first mainloop iteration, see below about next) r = 1 furtner
the outer loop, once again, is for ; r < n; r++
thus inside r ∈ [1, n)
thus perm1[r] and count[r] do not need BC
the outer loop only increases r which was initially known to be = 1, so after it is known to be >= 1
this way when main loop runs next time in

        for ; r != 1; r-- {
                count[r-1] = r
        }

the compiler can prove, since r >= 1 the loop will terminate with r==1, not loop forever

                        if count[r] > 0 {
-                       4885ff                  TESTQ DI, DI            
-                       7f10                    JG 0x2459               
+                       4d85c0                  TESTQ R8, R8            
+                       7f10                    JG 0x2305               
                for ; r < n; r++ {
-                       48ffc2                  INCQ DX                 
-                       4c39c2                  CMPQ R8, DX             
-                       7d0b                    JGE 0x245c              
+                       48ffc1                  INCQ CX                 
+                       4839f1                  CMPQ SI, CX             
+                       7d0b                    JGE 0x2308              
                        perm0 := perm1[0]
-                       488b3b                  MOVQ 0(BX), DI          
-                       4531d2                  XORL R10, R10           
+                       4c8b02                  MOVQ 0(DX), R8          
+                       4531db                  XORL R11, R11           
                        for i := 0; i < r; i++ {
-                       eba9                    JMP 0x2402              
+                       ebd7                    JMP 0x22dc              
                for ; r < n; r++ {
-                       4c39c2                  CMPQ R8, DX             
+                       4839f1                  CMPQ SI, CX             
                if r == n {
-                       741a                    JE 0x2478               
-                       488b7c2440              MOVQ 0x40(SP), DI       
-                       4d89ea                  MOVQ R13, R10           
+                       7415                    JE 0x231f               
+                       4d89d0                  MOVQ R10, R8            
                if didpr < 30 {
-                       4983f91e                CMPQ $0x1e, R9          
-                       0f8d78feffff            JGE 0x22e8              
+                       4883ff1e                CMPQ $0x1e, DI          
+                       0f8d2fffffff            JGE 0x2246              
                        didpr++
-                       49ffc1                  INCQ R9                 
+                       48ffc7                  INCQ DI                 
                if didpr < 30 {
-                       e970feffff              JMP 0x22e8              
+                       e927ffffff              JMP 0x2246              
                        return flipsMax
-                       4c89ac2490000000        MOVQ R13, 0x90(SP)      
-                       488b6c2478              MOVQ 0x78(SP), BP       
-                       4881c480000000          ADDQ $0x80, SP          
+                       4c89542458              MOVQ R10, 0x58(SP)      
+                       488b6c2440              MOVQ 0x40(SP), BP       
+                       4883c448                ADDQ $0x48, SP          
                        c3                      RET                     
-                       4d89d5                  MOVQ R10, R13           
+                       4d89c2                  MOVQ R8, R10            
                        if flipsMax < flips {
-                       ebba                    JMP 0x244c              
-                       4c8b5c2468              MOVQ 0x68(SP), R11      
-                       4c8b642438              MOVQ 0x38(SP), R12      
-                       4d89d5                  MOVQ R10, R13           
+                       ebc5                    JMP 0x22f8              
+                       4c8b4c2438              MOVQ 0x38(SP), R9       
+                       4d89c2                  MOVQ R8, R10            
                if perm1[0] != 0 && perm1[n1] != n1 {
-                       ebab                    JMP 0x244c              
-                       4c8b5c2468              MOVQ 0x68(SP), R11      
-                       4c8b642438              MOVQ 0x38(SP), R12      
-                       4d89d5                  MOVQ R10, R13           
-                       eb9c                    JMP 0x244c              
+                       ebbb                    JMP 0x22f8              
+                       4c8b4c2438              MOVQ 0x38(SP), R9       
+                       4d89c2                  MOVQ R8, R10            
+                       ebb1                    JMP 0x22f8              
                return 0
-                       48c784249000000000000000        MOVQ $0x0, 0x90(SP)     
-                       488b6c2478                      MOVQ 0x78(SP), BP       
-                       4881c480000000                  ADDQ $0x80, SP          
-                       c3                              RET                     
-                       count[r]--
-  0x24c9               e800000000              CALL 0x24ce             [1:5]R_CALL:runtime.panicindex
-  0x24ce               0f0b                    UD2                     
-                       perm1[r] = perm0
-  0x24d0               e800000000              CALL 0x24d5             [1:5]R_CALL:runtime.panicindex
-  0x24d5               0f0b                    UD2                     
-                               perm1[i] = perm1[i+1]
-  0x24d7               e800000000              CALL 0x24dc             [1:5]R_CALL:runtime.panicindex
-  0x24dc               0f0b                    UD2                     
-                               j := perm[k]
-  0x24de               e800000000              CALL 0x24e3             [1:5]R_CALL:runtime.panicindex
-  0x24e3               0f0b                    UD2                     
-                                       perm[i], perm[j] = perm[j], perm[i]
-  0x24e5               e800000000              CALL 0x24ea             [1:5]R_CALL:runtime.panicindex
-  0x24ea               0f0b                    UD2                     
-                               perm[i] = perm1[i]
-  0x24ec               e800000000              CALL 0x24f1             [1:5]R_CALL:runtime.panicindex
-  0x24f1               0f0b                    UD2                     
-               if perm1[0] != 0 && perm1[n1] != n1 {
-  0x24f3               e800000000              CALL 0x24f8             [1:5]R_CALL:runtime.panicindex
-  0x24f8               0f0b                    UD2                     
-                       count[r-1] = r
-  0x24fa               e800000000              CALL 0x24ff             [1:5]R_CALL:runtime.panicindex
-  0x24ff               0f0b                    UD2                     
-               perm1[i] = i // initial (trivial) permutation
-  0x2501               e800000000              CALL 0x2506             [1:5]R_CALL:runtime.panicindex
-  0x2506               0f0b                    UD2                     
+                       48c744245800000000      MOVQ $0x0, 0x58(SP)     
+                       488b6c2440              MOVQ 0x40(SP), BP       
+                       4883c448                ADDQ $0x48, SP          
+                       c3                      RET                     
 func fannkuch(n int) int {
-                       e800000000              CALL 0x250d             [1:5]R_CALL:runtime.morestack_noctxt
-                       e9c0fcffff              JMP %22%22.fannkuch(SB) 
+                       e800000000              CALL 0x235f             [1:5]R_CALL:runtime.morestack_noctxt
+                       e90efeffff              JMP %22%22.fannkuch(SB)

^^^ are all CALL runtime.panicindex destinations and they were all covered by my comments.

So maybe I missed something (appologize then) but above I was able to prove all runtime bounds check are unneccessary in fannkuch. I don't see any reason the compiler cannot be taught to do the same. Thus for fannkuch it is theoretically possible to reach -B level with just BCE.

The text was updated successfully, but these errors were encountered:

rasky · 2018-04-03T12:42:53Z

Fannkuch11 begins with a boundcheck that I thought was a low-hanging fruit:

func s1(n int) []int {
       s := make([]int, n)
       for i := 0; i < n; i++ {
               s[i] = i
       }
       return s
}

I thought that proving len(s) == n was going to be easy, but it turns out that it's complex because there is no generic SSA op for slice allocation; walk lowers make to a runtime call even before SSA conversion, so by the time prove runs, it's already an interwinded serie of load/store/call ops that is hard to look through.

@randall77 @mdempsky @aclements any suggestion on how to handle this? Is a new intermediate SSA op SliceBuild really needed? I see there's already SliceMake but that is for the slicing operation itself, not for allocating a new slice.

randall77 · 2018-04-03T14:41:26Z

Some of our rewrite rules look into the load/store/call ops to extract useful information. see cmd/compile/internal/ssa/gen/generic.rules, particularly those for calling newobject and memmove. You might be able to rewrite len(make([]int, n)) -> n.

I think adding a SliceAlloc op might be better long term, but it is a bigger change.

navytux · 2019-02-14T12:17:29Z

/cc @zdjones

gopherbot · 2019-09-06T17:05:25Z

Change https://golang.org/cl/190657 mentions this issue: cmd/compile: introduce recursive, on-demand inference in prove

zdjones · 2019-09-13T10:32:45Z

@navytux By my count, The CL above removes 6 of 12 bounds checks from Fannkuch.

navytux · 2019-09-13T11:05:21Z

@zdjones, thanks. It is good to see that BCE is being worked on.

rasky · 2019-09-13T12:35:11Z

Some BCEs in Funnkuch11 are possibly too hard, like those that require tracking that all values within a slice have a certain upper bound (since they are effectively slices of indices to another slice). @zdjones which are the ones left after your patch?

zdjones · 2019-09-14T08:38:59Z

I've marked the 12 remaining checks and noted which ones the patch removes.

func fannkuch(n int) int {
	if n < 1 {
		return 0
	}

	n1 := n - 1
	perm := make([]int, n)
	perm1 := make([]int, n)
	count := make([]int, n)

	for i := 0; i < n; i++ {
		perm1[i] = i // initial (trivial) permutation :8 Found IsInBounds (removed by trace)
	}

	r := n
	didpr := 0
	flipsMax := 0
	for {
		if didpr < 30 {
			didpr++
		}
		for ; r != 1; r-- {
			count[r-1] = r // :9 Found IsInBounds
		}

		if perm1[0] != 0 && perm1[n1] != n1 { // :11, :28 Found IsInBounds (both removed by trace)
			flips := 0
			for i := 1; i < n; i++ { // perm = perm1
				perm[i] = perm1[i] // :9 Found IsInBounds (removed by trace)
			}
			k := perm1[0] // cache perm[0] in k
			for {         // k!=0 ==> k>0
				for i, j := 1, k-1; i < j; i, j = i+1, j-1 {
					perm[i], perm[j] = perm[j], perm[i] // :29, :38 Found IsInBounds
				}
				flips++
				// Now exchange k (caching perm[0]) and perm[k]... with care!
				j := perm[k] // :14 Found IsInBounds
				perm[k] = k
				k = j
				if k == 0 {
					break
				}
			}
			if flipsMax < flips {
				flipsMax = flips
			}
		}

		for ; r < n; r++ {
			// rotate down perm[0..r] by one
			perm0 := perm1[0]
			for i := 0; i < r; i++ {
				perm1[i] = perm1[i+1] // :10, :21 Found IsInBounds
			}
			perm1[r] = perm0 // :9 Found IsInBounds (removed by trace)
			count[r]--       // :9 Found IsInBounds (removed by trace)
			if count[r] > 0 {
				break
			}
		}
		if r == n {
			return flipsMax
		}
	}
	return 0
}

gopherbot · 2019-09-15T22:45:46Z

Change https://golang.org/cl/195220 mentions this issue: cmd/compile: detect indvars that are bound by other indvars

rasky · 2019-09-15T23:01:08Z

Thanks! I noticed that this loop:

			for i := 0; i < r; i++ {
				perm1[i] = perm1[i+1] // :10, :21 Found IsInBounds
			}

wasn't being detected by loopbce, and the fix was easy enough (just mailed). CL 195220 detects ias an induction variable, which in turns allow to remove one of the bound checks in the loop body (the one at :10). I had high hopes for the second one as well, but it's not removed. I tried with @zdjones trace CL series, but it doesn't seem to help as well. I dunno why, some investigation is required.

If we fix that, and manage to merge your trace CL series, I think we're basically done with Fannkuch, as tracking the contents of a slice is probably outside of what we can reasonably implement, and is possibly useful only for algorithms storing indexes to arrays within other arrays, which doesn't sound common enough to be worth design and resolving this problem.

zdjones · 2019-09-16T07:59:52Z

Awesome! I'm going to take a closer look at your CL, and that second bounds check, when I have the time. I suspect that CL will remove a decent number bounds checks from std/cmd as well?

I think we may be able to get this one as well:

for ; r != 1; r-- {
	count[r-1] = r // :9 Found IsInBounds
}

I'll investigate. My suspicion is that this is also a Phi issue. My plan is to try adding Phi to trace... maybe that will be enough.

I agree about the remaining 3 checks; k := perm1[0] should make the bounds checks depending on k pretty difficult to get at.

rasky · 2019-09-16T08:17:35Z

I suspect that CL will remove a decent number bounds checks from std/cmd as well?

I actually have not had time to check.

for ; r != 1; r-- {
count[r-1] = r // :9 Found IsInBounds
}

The problem with this one is that the current induction variable logic can't handle a loop whose bounding condition is != rather than >, and whose beginning condition is empty (so it depends on the flow-sensitive bound of r to prove that the higher bound is less than n).

If trace grows to be able to handle this case, then it probably can substitute the whole induction variable detection code.

rasky · 2019-09-17T11:55:16Z

CL 195220 detects ias an induction variable, which in turns allow to remove one of the bound checks in the loop body (the one at :10). I had high hopes for the second one as well, but it's not removed. I tried with @zdjones trace CL series, but it doesn't seem to help as well. I dunno why, some investigation is required.

So, the problem there is that, in prove, we don't update ft.limits when we record relations between dependencies not involving constants. For instance if learn that v < w, we should record that ft.limits[v.ID].max = ft.limits[w.ID].max-1. I have a patch that does that, and it manages to remove that second bound check in the loop body. Obviously, once you do that, you realize that if you learned something new about w later on, there's no way to reflect it again to v limits.

Eventually, we should get rid of ft.limits altogether and teach poset how to extract known min/max bounds from the recorded relations. This would simplify prove quite a bit and make it much more powerful, as poset has a global overview of all current relations.

gopherbot · 2019-09-22T13:54:15Z

Change https://golang.org/cl/196784 mentions this issue: cmd/compile: in prove, learn facts from OpSliceMake

Now that OpSliceMake is called by runtime.makeslice callers, prove can see and record the actual length and cap of each slice being constructed. This small patch is enough to remove 260 additional bound checks from cmd+std. Thanks to Martin Möhrmann for pointing me to CL141822 that I had missed. Updates #24660 Change-Id: I14556850f285392051f3f07d13b456b608b64eb9 Reviewed-on: https://go-review.googlesource.com/c/go/+/196784 Run-TryBot: Giovanni Bajo <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: David Chase <[email protected]>

prove wasn't able to detect induction variables that was bound by another inducation variable. This happened because an indvar is a Phi, and thus in case of a dependency, the loop bounding condition looked as Phi < Phi. This triggered an existing codepath that checked whether the upper bound was a Phi to detect loop conditions written in reversed order respect to the idiomatic way (eg: for i:=0; len(n)>i; i++). To fix this, we call the indvar pattern matching on both operands of the loop condition, so that the first operand that matches will be treated as the indvar. Updates #24660 (removes a boundcheck from Fannkuch) Change-Id: Iade83d8deb54f14277ed3f2e37b190e1ed173d11 Reviewed-on: https://go-review.googlesource.com/c/go/+/195220 Reviewed-by: David Chase <[email protected]>

rasky · 2019-09-26T21:49:44Z

After the two CLs that just landed, I think we're only missing the check analyzed here among those that are remotely feasible.

I'm still experimenting with bound calculations in poset, but it's getting hairy. Not sure if I'll manage to land it, it might not be worth it, depending on how good it turns out to be when it's done. If I don't land it, I have another CL that improve ft.limits a little bit, enough to tackle this case and close this bug for good.

navytux mentioned this issue Apr 3, 2018

cmd/compile: no way to hint to eliminate bounds checks in bvec.AndNot #20393

Closed

rasky added Performance NeedsFix The path to resolution is known, but the work has not been done. labels Apr 3, 2018

bcmills added this to the Unplanned milestone Apr 3, 2018

navytux mentioned this issue May 3, 2018

cmd/compile: BCE/Prove do not take arithmetic into account #25197

Open

navytux mentioned this issue Aug 28, 2018

cmd/compile: Fannkuch benchmark performance regression #27108

Open

rasky mentioned this issue Jun 10, 2019

cmd/compile: prove pass unable to eliminate bounds check when a variable is assigned from len #32515

Open

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022

mknyszek added this to Go Compiler / Runtime Jul 13, 2022

mknyszek removed this from Go Compiler / Runtime Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: Fannkuch11 can be speed up by ~ 30% via improving BCE #24660

cmd/compile: Fannkuch11 can be speed up by ~ 30% via improving BCE #24660

navytux commented Apr 3, 2018 •

edited by rasky

Loading

rasky commented Apr 3, 2018

randall77 commented Apr 3, 2018

navytux commented Feb 14, 2019

gopherbot commented Sep 6, 2019

zdjones commented Sep 13, 2019

navytux commented Sep 13, 2019

rasky commented Sep 13, 2019

zdjones commented Sep 14, 2019 •

edited

Loading

gopherbot commented Sep 15, 2019

rasky commented Sep 15, 2019 •

edited

Loading

zdjones commented Sep 16, 2019

rasky commented Sep 16, 2019

rasky commented Sep 17, 2019

gopherbot commented Sep 22, 2019

rasky commented Sep 26, 2019

cmd/compile: Fannkuch11 can be speed up by ~ 30% via improving BCE #24660

cmd/compile: Fannkuch11 can be speed up by ~ 30% via improving BCE #24660

Comments

navytux commented Apr 3, 2018 • edited by rasky Loading

rasky commented Apr 3, 2018

randall77 commented Apr 3, 2018

navytux commented Feb 14, 2019

gopherbot commented Sep 6, 2019

zdjones commented Sep 13, 2019

navytux commented Sep 13, 2019

rasky commented Sep 13, 2019

zdjones commented Sep 14, 2019 • edited Loading

gopherbot commented Sep 15, 2019

rasky commented Sep 15, 2019 • edited Loading

zdjones commented Sep 16, 2019

rasky commented Sep 16, 2019

rasky commented Sep 17, 2019

gopherbot commented Sep 22, 2019

rasky commented Sep 26, 2019

navytux commented Apr 3, 2018 •

edited by rasky

Loading

zdjones commented Sep 14, 2019 •

edited

Loading

rasky commented Sep 15, 2019 •

edited

Loading