-
Notifications
You must be signed in to change notification settings - Fork 18k
cmd/internal/obj: optimize wrapper method prologue for branch prediction #19042
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
CL https://golang.org/cl/36893 mentions this issue. |
Hmm. #7491 contains a standalone test, and there is a corresponding test in test/recover.go. So I guess there is test coverage. I still don't fully understand what's going on, but I guess that's ok for the purposes of CL 36893, which is just rearranging deck chairs. |
…ologue Static branch prediction assumes that forward branches are not taken. The existing wrapper prologue almost always takes the first forward branch. Move the rare case to the end of the function. This CL is amd64 only. Other architectures will be done in separate CLs. Updates golang#19042. Package sort benchmarks: SearchWrappers-8 104ns ± 2% 104ns ± 0% -0.41% (p=0.006 n=30+41) SortString1K-8 128µs ± 1% 128µs ± 1% -0.25% (p=0.045 n=30+56) SortString1K_Slice-8 117µs ± 1% 117µs ± 1% ~ (p=0.855 n=30+59) StableString1K-8 18.6µs ± 1% 18.6µs ± 1% ~ (p=0.599 n=29+60) SortInt1K-8 61.0µs ± 1% 56.5µs ± 1% -7.36% (p=0.000 n=29+58) StableInt1K-8 74.6µs ± 1% 70.4µs ± 3% -5.54% (p=0.000 n=28+60) StableInt1K_Slice-8 59.9µs ± 1% 58.3µs ± 4% -2.64% (p=0.000 n=29+60) SortInt64K-8 6.02ms ± 2% 5.98ms ± 2% -0.60% (p=0.000 n=29+59) SortInt64K_Slice-8 5.07ms ± 2% 5.05ms ± 2% -0.38% (p=0.006 n=30+58) StableInt64K-8 6.41ms ± 1% 6.22ms ± 1% -3.00% (p=0.000 n=27+58) Sort1e2-8 37.4µs ± 1% 37.1µs ± 1% -0.91% (p=0.000 n=30+57) Stable1e2-8 74.8µs ± 1% 75.2µs ± 1% +0.52% (p=0.000 n=30+57) Sort1e4-8 8.11ms ± 1% 8.01ms ± 1% -1.20% (p=0.000 n=30+59) Stable1e4-8 24.3ms ± 1% 24.3ms ± 1% ~ (p=0.157 n=30+60) Sort1e6-8 1.25s ± 1% 1.23s ± 1% -1.43% (p=0.000 n=29+58) Stable1e6-8 4.93s ± 1% 4.90s ± 1% -0.56% (p=0.000 n=29+59) [Geo mean] 720µs 709µs -1.52% Assembly for sort.(*intPairs).Swap: Before: "".(*intPairs).Swap t=1 size=147 args=0x18 locals=0x8 0x0000 00000 (<autogenerated>:1) TEXT "".(*intPairs).Swap(SB), $8-24 0x0000 00000 (<autogenerated>:1) MOVQ (TLS), CX 0x0009 00009 (<autogenerated>:1) SUBQ $8, SP 0x000d 00013 (<autogenerated>:1) MOVQ BP, (SP) 0x0011 00017 (<autogenerated>:1) LEAQ (SP), BP 0x0015 00021 (<autogenerated>:1) MOVQ 32(CX), BX 0x0019 00025 (<autogenerated>:1) TESTQ BX, BX 0x001c 00028 (<autogenerated>:1) JEQ 43 0x001e 00030 (<autogenerated>:1) LEAQ 16(SP), DI 0x0023 00035 (<autogenerated>:1) CMPQ (BX), DI 0x0026 00038 (<autogenerated>:1) JNE 43 0x0028 00040 (<autogenerated>:1) MOVQ SP, (BX) 0x002b 00043 (<autogenerated>:1) NOP 0x002b 00043 (<autogenerated>:1) FUNCDATA $0, gclocals·e6397a44f8e1b6e77d0f200b4fba5269(SB) 0x002b 00043 (<autogenerated>:1) FUNCDATA $1, gclocals·69c1753bd5f81501d95132d08af04464(SB) 0x002b 00043 (<autogenerated>:1) MOVQ ""..this+16(FP), AX 0x0030 00048 (<autogenerated>:1) TESTQ AX, AX 0x0033 00051 (<autogenerated>:1) JEQ $0, 140 0x0035 00053 (<autogenerated>:1) MOVQ (AX), CX 0x0038 00056 (<autogenerated>:1) MOVQ 8(AX), AX 0x003c 00060 (<autogenerated>:1) MOVQ "".i+24(FP), DX 0x0041 00065 (<autogenerated>:1) CMPQ DX, AX 0x0044 00068 (<autogenerated>:1) JCC $0, 133 0x0046 00070 (<autogenerated>:1) SHLQ $4, DX 0x004a 00074 (<autogenerated>:1) MOVQ 8(CX)(DX*1), BX 0x004f 00079 (<autogenerated>:1) MOVQ (CX)(DX*1), SI 0x0053 00083 (<autogenerated>:1) MOVQ "".j+32(FP), DI 0x0058 00088 (<autogenerated>:1) CMPQ DI, AX 0x005b 00091 (<autogenerated>:1) JCC $0, 133 0x005d 00093 (<autogenerated>:1) SHLQ $4, DI 0x0061 00097 (<autogenerated>:1) MOVQ 8(CX)(DI*1), AX 0x0066 00102 (<autogenerated>:1) MOVQ (CX)(DI*1), R8 0x006a 00106 (<autogenerated>:1) MOVQ R8, (CX)(DX*1) 0x006e 00110 (<autogenerated>:1) MOVQ AX, 8(CX)(DX*1) 0x0073 00115 (<autogenerated>:1) MOVQ SI, (CX)(DI*1) 0x0077 00119 (<autogenerated>:1) MOVQ BX, 8(CX)(DI*1) 0x007c 00124 (<autogenerated>:1) MOVQ (SP), BP 0x0080 00128 (<autogenerated>:1) ADDQ $8, SP 0x0084 00132 (<autogenerated>:1) RET 0x0085 00133 (<autogenerated>:1) PCDATA $0, $1 0x0085 00133 (<autogenerated>:1) CALL runtime.panicindex(SB) 0x008a 00138 (<autogenerated>:1) UNDEF 0x008c 00140 (<autogenerated>:1) PCDATA $0, $1 0x008c 00140 (<autogenerated>:1) CALL runtime.panicwrap(SB) 0x0091 00145 (<autogenerated>:1) UNDEF After: "".(*intPairs).Swap t=1 size=149 args=0x18 locals=0x8 0x0000 00000 (<autogenerated>:1) TEXT "".(*intPairs).Swap(SB), $8-24 0x0000 00000 (<autogenerated>:1) MOVQ (TLS), CX 0x0009 00009 (<autogenerated>:1) SUBQ $8, SP 0x000d 00013 (<autogenerated>:1) MOVQ BP, (SP) 0x0011 00017 (<autogenerated>:1) LEAQ (SP), BP 0x0015 00021 (<autogenerated>:1) MOVQ 32(CX), BX 0x0019 00025 (<autogenerated>:1) TESTQ BX, BX 0x001c 00028 (<autogenerated>:1) JNE 134 0x001e 00030 (<autogenerated>:1) NOP 0x001e 00030 (<autogenerated>:1) FUNCDATA $0, gclocals·e6397a44f8e1b6e77d0f200b4fba5269(SB) 0x001e 00030 (<autogenerated>:1) FUNCDATA $1, gclocals·69c1753bd5f81501d95132d08af04464(SB) 0x001e 00030 (<autogenerated>:1) MOVQ ""..this+16(FP), AX 0x0023 00035 (<autogenerated>:1) TESTQ AX, AX 0x0026 00038 (<autogenerated>:1) JEQ $0, 127 0x0028 00040 (<autogenerated>:1) MOVQ (AX), CX 0x002b 00043 (<autogenerated>:1) MOVQ 8(AX), AX 0x002f 00047 (<autogenerated>:1) MOVQ "".i+24(FP), DX 0x0034 00052 (<autogenerated>:1) CMPQ DX, AX 0x0037 00055 (<autogenerated>:1) JCC $0, 120 0x0039 00057 (<autogenerated>:1) SHLQ $4, DX 0x003d 00061 (<autogenerated>:1) MOVQ 8(CX)(DX*1), BX 0x0042 00066 (<autogenerated>:1) MOVQ (CX)(DX*1), SI 0x0046 00070 (<autogenerated>:1) MOVQ "".j+32(FP), DI 0x004b 00075 (<autogenerated>:1) CMPQ DI, AX 0x004e 00078 (<autogenerated>:1) JCC $0, 120 0x0050 00080 (<autogenerated>:1) SHLQ $4, DI 0x0054 00084 (<autogenerated>:1) MOVQ 8(CX)(DI*1), AX 0x0059 00089 (<autogenerated>:1) MOVQ (CX)(DI*1), R8 0x005d 00093 (<autogenerated>:1) MOVQ R8, (CX)(DX*1) 0x0061 00097 (<autogenerated>:1) MOVQ AX, 8(CX)(DX*1) 0x0066 00102 (<autogenerated>:1) MOVQ SI, (CX)(DI*1) 0x006a 00106 (<autogenerated>:1) MOVQ BX, 8(CX)(DI*1) 0x006f 00111 (<autogenerated>:1) MOVQ (SP), BP 0x0073 00115 (<autogenerated>:1) ADDQ $8, SP 0x0077 00119 (<autogenerated>:1) RET 0x0078 00120 (<autogenerated>:1) PCDATA $0, $1 0x0078 00120 (<autogenerated>:1) CALL runtime.panicindex(SB) 0x007d 00125 (<autogenerated>:1) UNDEF 0x007f 00127 (<autogenerated>:1) PCDATA $0, $1 0x007f 00127 (<autogenerated>:1) CALL runtime.panicwrap(SB) 0x0084 00132 (<autogenerated>:1) UNDEF 0x0086 00134 (<autogenerated>:1) LEAQ 16(SP), DI 0x008b 00139 (<autogenerated>:1) CMPQ (BX), DI 0x008e 00142 (<autogenerated>:1) JNE 30 0x0090 00144 (<autogenerated>:1) MOVQ SP, (BX) 0x0093 00147 (<autogenerated>:1) JMP 30 Change-Id: Ie8c37f384bba10fbacaa754bb0a6b0a7e520ef01
I'm not sure I understand what is going on here either. The argp involved is used by recover() to determine which panic is being recovered. I guess when there's a wrapper involved the argp needs to be moved from the args of the *T function to the args of the T function, so that if the T function recovers it can squash the associated panic. So maybe this code triggers it?
|
…ologue Static branch prediction assumes that forward branches are not taken. The existing wrapper prologue almost always takes the first forward branch. Move the rare case to the end of the function. This CL is amd64 only. Other architectures will be done in separate CLs. Updates #19042. Package sort benchmarks: SearchWrappers-8 104ns ± 2% 104ns ± 0% -0.41% (p=0.006 n=30+41) SortString1K-8 128µs ± 1% 128µs ± 1% -0.25% (p=0.045 n=30+56) SortString1K_Slice-8 117µs ± 1% 117µs ± 1% ~ (p=0.855 n=30+59) StableString1K-8 18.6µs ± 1% 18.6µs ± 1% ~ (p=0.599 n=29+60) SortInt1K-8 61.0µs ± 1% 56.5µs ± 1% -7.36% (p=0.000 n=29+58) StableInt1K-8 74.6µs ± 1% 70.4µs ± 3% -5.54% (p=0.000 n=28+60) StableInt1K_Slice-8 59.9µs ± 1% 58.3µs ± 4% -2.64% (p=0.000 n=29+60) SortInt64K-8 6.02ms ± 2% 5.98ms ± 2% -0.60% (p=0.000 n=29+59) SortInt64K_Slice-8 5.07ms ± 2% 5.05ms ± 2% -0.38% (p=0.006 n=30+58) StableInt64K-8 6.41ms ± 1% 6.22ms ± 1% -3.00% (p=0.000 n=27+58) Sort1e2-8 37.4µs ± 1% 37.1µs ± 1% -0.91% (p=0.000 n=30+57) Stable1e2-8 74.8µs ± 1% 75.2µs ± 1% +0.52% (p=0.000 n=30+57) Sort1e4-8 8.11ms ± 1% 8.01ms ± 1% -1.20% (p=0.000 n=30+59) Stable1e4-8 24.3ms ± 1% 24.3ms ± 1% ~ (p=0.157 n=30+60) Sort1e6-8 1.25s ± 1% 1.23s ± 1% -1.43% (p=0.000 n=29+58) Stable1e6-8 4.93s ± 1% 4.90s ± 1% -0.56% (p=0.000 n=29+59) [Geo mean] 720µs 709µs -1.52% Assembly for sort.(*intPairs).Swap: Before: "".(*intPairs).Swap t=1 size=147 args=0x18 locals=0x8 0x0000 00000 (<autogenerated>:1) TEXT "".(*intPairs).Swap(SB), $8-24 0x0000 00000 (<autogenerated>:1) MOVQ (TLS), CX 0x0009 00009 (<autogenerated>:1) SUBQ $8, SP 0x000d 00013 (<autogenerated>:1) MOVQ BP, (SP) 0x0011 00017 (<autogenerated>:1) LEAQ (SP), BP 0x0015 00021 (<autogenerated>:1) MOVQ 32(CX), BX 0x0019 00025 (<autogenerated>:1) TESTQ BX, BX 0x001c 00028 (<autogenerated>:1) JEQ 43 0x001e 00030 (<autogenerated>:1) LEAQ 16(SP), DI 0x0023 00035 (<autogenerated>:1) CMPQ (BX), DI 0x0026 00038 (<autogenerated>:1) JNE 43 0x0028 00040 (<autogenerated>:1) MOVQ SP, (BX) 0x002b 00043 (<autogenerated>:1) NOP 0x002b 00043 (<autogenerated>:1) FUNCDATA $0, gclocals·e6397a44f8e1b6e77d0f200b4fba5269(SB) 0x002b 00043 (<autogenerated>:1) FUNCDATA $1, gclocals·69c1753bd5f81501d95132d08af04464(SB) 0x002b 00043 (<autogenerated>:1) MOVQ ""..this+16(FP), AX 0x0030 00048 (<autogenerated>:1) TESTQ AX, AX 0x0033 00051 (<autogenerated>:1) JEQ $0, 140 0x0035 00053 (<autogenerated>:1) MOVQ (AX), CX 0x0038 00056 (<autogenerated>:1) MOVQ 8(AX), AX 0x003c 00060 (<autogenerated>:1) MOVQ "".i+24(FP), DX 0x0041 00065 (<autogenerated>:1) CMPQ DX, AX 0x0044 00068 (<autogenerated>:1) JCC $0, 133 0x0046 00070 (<autogenerated>:1) SHLQ $4, DX 0x004a 00074 (<autogenerated>:1) MOVQ 8(CX)(DX*1), BX 0x004f 00079 (<autogenerated>:1) MOVQ (CX)(DX*1), SI 0x0053 00083 (<autogenerated>:1) MOVQ "".j+32(FP), DI 0x0058 00088 (<autogenerated>:1) CMPQ DI, AX 0x005b 00091 (<autogenerated>:1) JCC $0, 133 0x005d 00093 (<autogenerated>:1) SHLQ $4, DI 0x0061 00097 (<autogenerated>:1) MOVQ 8(CX)(DI*1), AX 0x0066 00102 (<autogenerated>:1) MOVQ (CX)(DI*1), R8 0x006a 00106 (<autogenerated>:1) MOVQ R8, (CX)(DX*1) 0x006e 00110 (<autogenerated>:1) MOVQ AX, 8(CX)(DX*1) 0x0073 00115 (<autogenerated>:1) MOVQ SI, (CX)(DI*1) 0x0077 00119 (<autogenerated>:1) MOVQ BX, 8(CX)(DI*1) 0x007c 00124 (<autogenerated>:1) MOVQ (SP), BP 0x0080 00128 (<autogenerated>:1) ADDQ $8, SP 0x0084 00132 (<autogenerated>:1) RET 0x0085 00133 (<autogenerated>:1) PCDATA $0, $1 0x0085 00133 (<autogenerated>:1) CALL runtime.panicindex(SB) 0x008a 00138 (<autogenerated>:1) UNDEF 0x008c 00140 (<autogenerated>:1) PCDATA $0, $1 0x008c 00140 (<autogenerated>:1) CALL runtime.panicwrap(SB) 0x0091 00145 (<autogenerated>:1) UNDEF After: "".(*intPairs).Swap t=1 size=149 args=0x18 locals=0x8 0x0000 00000 (<autogenerated>:1) TEXT "".(*intPairs).Swap(SB), $8-24 0x0000 00000 (<autogenerated>:1) MOVQ (TLS), CX 0x0009 00009 (<autogenerated>:1) SUBQ $8, SP 0x000d 00013 (<autogenerated>:1) MOVQ BP, (SP) 0x0011 00017 (<autogenerated>:1) LEAQ (SP), BP 0x0015 00021 (<autogenerated>:1) MOVQ 32(CX), BX 0x0019 00025 (<autogenerated>:1) TESTQ BX, BX 0x001c 00028 (<autogenerated>:1) JNE 134 0x001e 00030 (<autogenerated>:1) NOP 0x001e 00030 (<autogenerated>:1) FUNCDATA $0, gclocals·e6397a44f8e1b6e77d0f200b4fba5269(SB) 0x001e 00030 (<autogenerated>:1) FUNCDATA $1, gclocals·69c1753bd5f81501d95132d08af04464(SB) 0x001e 00030 (<autogenerated>:1) MOVQ ""..this+16(FP), AX 0x0023 00035 (<autogenerated>:1) TESTQ AX, AX 0x0026 00038 (<autogenerated>:1) JEQ $0, 127 0x0028 00040 (<autogenerated>:1) MOVQ (AX), CX 0x002b 00043 (<autogenerated>:1) MOVQ 8(AX), AX 0x002f 00047 (<autogenerated>:1) MOVQ "".i+24(FP), DX 0x0034 00052 (<autogenerated>:1) CMPQ DX, AX 0x0037 00055 (<autogenerated>:1) JCC $0, 120 0x0039 00057 (<autogenerated>:1) SHLQ $4, DX 0x003d 00061 (<autogenerated>:1) MOVQ 8(CX)(DX*1), BX 0x0042 00066 (<autogenerated>:1) MOVQ (CX)(DX*1), SI 0x0046 00070 (<autogenerated>:1) MOVQ "".j+32(FP), DI 0x004b 00075 (<autogenerated>:1) CMPQ DI, AX 0x004e 00078 (<autogenerated>:1) JCC $0, 120 0x0050 00080 (<autogenerated>:1) SHLQ $4, DI 0x0054 00084 (<autogenerated>:1) MOVQ 8(CX)(DI*1), AX 0x0059 00089 (<autogenerated>:1) MOVQ (CX)(DI*1), R8 0x005d 00093 (<autogenerated>:1) MOVQ R8, (CX)(DX*1) 0x0061 00097 (<autogenerated>:1) MOVQ AX, 8(CX)(DX*1) 0x0066 00102 (<autogenerated>:1) MOVQ SI, (CX)(DI*1) 0x006a 00106 (<autogenerated>:1) MOVQ BX, 8(CX)(DI*1) 0x006f 00111 (<autogenerated>:1) MOVQ (SP), BP 0x0073 00115 (<autogenerated>:1) ADDQ $8, SP 0x0077 00119 (<autogenerated>:1) RET 0x0078 00120 (<autogenerated>:1) PCDATA $0, $1 0x0078 00120 (<autogenerated>:1) CALL runtime.panicindex(SB) 0x007d 00125 (<autogenerated>:1) UNDEF 0x007f 00127 (<autogenerated>:1) PCDATA $0, $1 0x007f 00127 (<autogenerated>:1) CALL runtime.panicwrap(SB) 0x0084 00132 (<autogenerated>:1) UNDEF 0x0086 00134 (<autogenerated>:1) LEAQ 16(SP), DI 0x008b 00139 (<autogenerated>:1) CMPQ (BX), DI 0x008e 00142 (<autogenerated>:1) JNE 30 0x0090 00144 (<autogenerated>:1) MOVQ SP, (BX) 0x0093 00147 (<autogenerated>:1) JMP 30 Change-Id: Ie8c37f384bba10fbacaa754bb0a6b0a7e520ef01 Reviewed-on: https://go-review.googlesource.com/36893 Reviewed-by: Keith Randall <[email protected]>
CL https://golang.org/cl/37611 mentions this issue. |
…ologue This is a follow-up to CL 36893. Move the unlikely branch in the wrapper prologue to the end of the function, where it has minimal impact on the instruction cache. Static branch prediction is also less likely to choose a forward branch. Updates #19042 sort benchmarks: name old time/op new time/op delta SearchWrappers-4 1.44µs ± 0% 1.45µs ± 0% +1.15% (p=0.000 n=9+10) SortString1K-4 1.02ms ± 0% 1.04ms ± 0% +2.39% (p=0.000 n=10+10) SortString1K_Slice-4 960µs ± 0% 989µs ± 0% +2.95% (p=0.000 n=9+10) StableString1K-4 218µs ± 0% 213µs ± 0% -2.13% (p=0.000 n=10+10) SortInt1K-4 541µs ± 0% 543µs ± 0% +0.30% (p=0.003 n=9+10) StableInt1K-4 760µs ± 1% 763µs ± 1% +0.38% (p=0.011 n=10+10) StableInt1K_Slice-4 840µs ± 1% 779µs ± 0% -7.31% (p=0.000 n=9+10) SortInt64K-4 55.2ms ± 0% 55.4ms ± 1% +0.34% (p=0.012 n=10+8) SortInt64K_Slice-4 56.2ms ± 0% 55.6ms ± 1% -1.16% (p=0.000 n=10+10) StableInt64K-4 70.9ms ± 1% 71.0ms ± 0% ~ (p=0.315 n=10+7) Sort1e2-4 250µs ± 0% 249µs ± 1% ~ (p=0.315 n=9+10) Stable1e2-4 600µs ± 0% 594µs ± 0% -1.09% (p=0.000 n=9+10) Sort1e4-4 51.2ms ± 0% 51.4ms ± 1% +0.40% (p=0.001 n=9+10) Stable1e4-4 204ms ± 1% 199ms ± 1% -2.27% (p=0.000 n=10+10) Sort1e6-4 8.42s ± 0% 8.44s ± 0% +0.28% (p=0.000 n=8+9) Stable1e6-4 43.3s ± 0% 42.5s ± 1% -1.89% (p=0.000 n=9+9) Change-Id: I827559aa557fdba211a38ce3f77137b471c5c67e Reviewed-on: https://go-review.googlesource.com/37611 Run-TryBot: Josh Bleecher Snyder <[email protected]> Reviewed-by: Josh Bleecher Snyder <[email protected]>
New tasks include: golang/go#19675 cmd/vet: report uses of -0 in float32/64 context golang/go#19683 cmd/compile: eliminate usages of global lineno golang/go#19670 x/tools/go/ssa: make opaqueType less annoying to use golang/go#19636 encoding/base64: decoding is slow golang/go#23471 x/perf/cmd/benchstat: tips or quickstart for newcomers golang/go#19577 test: errorcheck support for intraline errors golang/go#19490 cmd/vet: reduce the amount of false positives for -shadow mode golang/go#19042 cmd/internal/obj: optimize wrapper method prologue for branch prediction golang/go#19013 cmd/compile: add tool for understanding/debugging SSA rules
Wrapper functions (such as those found in CL 36809) have an extra prologue inserted during preprocessing. Quoting cmd/internal/obj/xk86/obj6.go (after CL 36833), what this does is:
These jumps are the wrong direction. Static branch prediction says forward jumps are not taken, but the first forward jump is almost always taken. Rearrange the code to fix that.
I have a CL, to be mailed momentarily, that does this. However, I am having a hard time writing a test for this, because I cannot find a way to trigger the rare (in-a-panic) cases. From reading the panic sources, this only happens when panicking inside a deferred statement (?), but even then I see the wrapper function in the traceback.
What exactly is this prologue for? How do I test that it is doing its job correctly?
cc @randall77 @minux
The text was updated successfully, but these errors were encountered: