-
Notifications
You must be signed in to change notification settings - Fork 18k
cmd/compile: excessively slow compile for 60K lines function #15112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Seems to work at tip - albeit incredibly slow:
It's a huge file, but for the reference, gotype type-checks this in 0.9s. There's some algorithmic issue here. |
@Khaaan can you confirm that this code (or whatever original you have derived this from) compiles (if extremely slow) when compiling with the latest compiler at tip? |
@griesemer It compiles for me at tip: +f229e46 $ time go tool compile buildme.go real 1m55.632s |
Thanks. There's probably some non-linearity in the backend somewhere. Adjusting title. |
Looks like it's spending most of the time in CSE. (Surprise!) cc @tzneal |
Started. |
FYI, I may be mucking around with Types, meaning, canonicalizing them down to a 32-bit integer, as part of improvements to register allocation. Not sure if this is going to matter to you for this, but just a heads-up. |
CL https://golang.org/cl/21937 mentions this issue. |
runtime.newobject never returns the same thing twice, so the resulting value will never be a common subexpression. This helps when compiling large static data structures that include pointers, such as maps and slices. No clear performance impact on other code. (See below.) For the code in issue #15112: Before: real 1m14.238s user 1m18.985s sys 0m0.787s After: real 0m47.172s user 0m52.248s sys 0m0.767s For the code in issue #15235, size 10k: Before: real 0m44.916s user 0m46.577s sys 0m0.304s After: real 0m7.703s user 0m9.041s sys 0m0.316s Still more work to be done, particularly for #15112. Updates #15112 Updates #15235 name old time/op new time/op delta Template 330ms ±11% 333ms ±13% ~ (p=0.749 n=20+19) Unicode 148ms ± 6% 152ms ± 8% ~ (p=0.072 n=18+20) GoTypes 1.01s ± 7% 1.01s ± 3% ~ (p=0.583 n=20+20) Compiler 5.04s ± 2% 5.06s ± 2% ~ (p=0.314 n=20+20) name old user-ns/op new user-ns/op delta Template 444user-ms ±11% 445user-ms ±10% ~ (p=0.738 n=20+20) Unicode 215user-ms ± 5% 218user-ms ± 5% ~ (p=0.239 n=18+18) GoTypes 1.45user-s ± 3% 1.45user-s ± 4% ~ (p=0.620 n=20+20) Compiler 7.23user-s ± 2% 7.22user-s ± 2% ~ (p=0.901 n=20+19) name old alloc/op new alloc/op delta Template 55.0MB ± 0% 55.0MB ± 0% ~ (p=0.547 n=20+20) Unicode 37.6MB ± 0% 37.6MB ± 0% ~ (p=0.301 n=20+20) GoTypes 177MB ± 0% 177MB ± 0% ~ (p=0.065 n=20+19) Compiler 798MB ± 0% 797MB ± 0% -0.05% (p=0.000 n=19+20) name old allocs/op new allocs/op delta Template 492k ± 0% 493k ± 0% +0.03% (p=0.030 n=20+20) Unicode 377k ± 0% 377k ± 0% ~ (p=0.423 n=20+19) GoTypes 1.40M ± 0% 1.40M ± 0% ~ (p=0.102 n=20+20) Compiler 5.53M ± 0% 5.53M ± 0% ~ (p=0.094 n=17+18) name old text-bytes new text-bytes delta HelloSize 561k ± 0% 561k ± 0% ~ (all samples are equal) CmdGoSize 6.13M ± 0% 6.13M ± 0% ~ (all samples are equal) name old data-bytes new data-bytes delta HelloSize 128k ± 0% 128k ± 0% ~ (all samples are equal) CmdGoSize 306k ± 0% 306k ± 0% ~ (all samples are equal) name old exe-bytes new exe-bytes delta HelloSize 905k ± 0% 905k ± 0% ~ (all samples are equal) CmdGoSize 9.64M ± 0% 9.64M ± 0% ~ (all samples are equal) Change-Id: Id774e2901d7701a3ec7a1c1d1cf1d9327a4107fc Reviewed-on: https://go-review.googlesource.com/21937 Run-TryBot: Josh Bleecher Snyder <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Todd Neal <[email protected]>
CL 21961 will help a bit more (if it goes in). The bottleneck now is the O(n^2) invocation of isAncestorEq. I don't see any clear algorithmic improvements there, but perhaps @dr2chase or @tzneal does. The only further think I can think of to try is to make isAncestorEq itself cheaper. That's not going to be easy. Perhaps we could cache the entry and exit numbers per value in the eqclass to avoid the indirect lookups in the sparse tree? But that'll cost allocations, and it's not fixing the root cause. I don't plan to work on this further. It is considerably faster; I hope that it's improved enough that we can downgrade from Go1.7 to Unplanned. |
CL https://golang.org/cl/21961 mentions this issue. |
I'm working on another CL that gives a 50% speedup for this issue. It just sorts each partition by isAncestorEq to remove the need to find the maximal dominator each pass through. The other thing I was trying was to reduce the number of values created in the first place. Every reference to a PEXTERN ONAME ("moves" in this case) generates another OpAddr value which all then get removed by cse. The other large sets of duplicate values for this case are stack addresses for calling runtime.newobject. |
Did you see CL 21961? I also tried an initial sort. The primary problem is that isAncestor(Eq) provides only a partial ordering, so you fundamentally need to look at all pairs to get a complete sort. And then you need to maintain the sort order as you remove elements, which is itself expensive. That's how I ended up at CL 21961. But I'd be delighted if you see something I didn't. :) Reducing the number of values would be a big win. If you can reduce the number of Nodes, that'd help everyone, since compile time in general is at this point roughly proportional to the number of Nodes, for non-pathological files. I wonder whether stack addresses (or maybe a subset of them?) should be put in the constant cache. |
I saw it, but thought that the sort by dom might be cleaner. I was going to flail around a bit with it anyway :) I saw the partial ordering issue after I started coding and then began using the entry value from sdom to provide an absolute ordering. I just finished up maintaining the sort order to see if it's worth it. I've got a DNR CL 21981 if you wanted to take a look. I'm on a slowish Macbook so my benchmark times are a little worse in absolute terms, but it's still a ~30% improvement. |
CL https://golang.org/cl/21981 mentions this issue. |
We do two O(n) scans of all values in an eqclass when computing substitutions for CSE. In unfortunate cases, like those found in #15112, we can have a large eqclass composed of values found in blocks none of whom dominate the other. This leads to O(n^2) behavior. The elements are removed one at a time, with O(n) scans each time. This CL removes the linear scan by sorting the eqclass so that dominant values will be sorted first. As long as we also ensure we don't disturb the sort order, then we no longer need to scan for the maximally dominant value. For the code in issue #15112: Before: real 1m26.094s user 1m30.776s sys 0m1.125s Aefter: real 0m52.099s user 0m56.829s sys 0m1.092s Updates #15112 Change-Id: Ic4f8680ed172e716232436d31963209c146ef850 Reviewed-on: https://go-review.googlesource.com/21981 Reviewed-by: Josh Bleecher Snyder <[email protected]> Run-TryBot: Todd Neal <[email protected]> TryBot-Result: Gobot Gobot <[email protected]>
Process a slice of equivalent values by setting replaced values to nil instead of removing them from the slice to eliminate copying. Also take advantage of the entry number sort to break early once we reach a value in a block that is not dominated. For the code in issue #15112: Before: real 0m52.603s user 0m56.957s sys 0m1.213s After: real 0m22.048s user 0m26.445s sys 0m0.939s Updates #15112 Change-Id: I06d9e1e1f1ad85d7fa196c5d51f0dc163907376d Reviewed-on: https://go-review.googlesource.com/22068 Reviewed-by: David Chase <[email protected]>
@josharian - Caching stack address is harder than I thought due to the value's type in most cases being a unique instance of a Type. I started looking at caching types in Ptrto() to solve this, but the Node's type is unique there as well. The easy cases are args to runtime arguments which are all given type Types[TUINTPTR]. Past that I don't see a great way to reduce the OpOffPtr values. |
I've got a type-interner working (based on Compare) that might help here. So far I am only using it for LocalSlot because those are sometimes too fat. Will see about making it more pervasive. |
A quick profile suggests that the only unusual performance bit here is in fact type comparison, and even that's only ~13% of compile time. So: vast strides. |
I'm going to move this to 1.8, I don't think we have anything else to help here for 1.7. |
@tzneal #15520 is another case in which we do a bunch of work to collapse OpSP offsets. In that case, they all have the same type, though they don't compare the same using ==. A cache of these keyed by (off, type.String) would help here, at least. (Similar to the const cache, callers would need to make sure that their type is actually CMPeq, not just string-equal.) I think we should look again at this idea for Go 1.8. |
CL https://golang.org/cl/30354 mentions this issue. |
The go 1.6 compiler panics on this 2.25 Mbyte source file containing a slice with 13240 elements: https://drive.google.com/file/d/0B2AF520HhpuzdzZhSVpPZGZkTUU/view
go1.6 linux/amd64
GOARCH="amd64"
GOOS="linux"
The text was updated successfully, but these errors were encountered: