-
Notifications
You must be signed in to change notification settings - Fork 784
Loop array variable optimization #1054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The pass which does this in LLVM is LoopStrengthReduce, which is indeed not run in the fastcomp backend and is run in the wasm backend. LLVM appears to prefer two induction variables in this case because it eliminates a If you want to experiment with disabling this pass, you can add -disable-lsr to the llc command. |
Thanks, it does look like that's a factor here. Disabling that pass makes fannkuch 7.5% faster and 1.2% smaller. |
Where is it 7.5% faster? |
I tested SpiderMonkey and v8. Which I guess raises the question, do any wasm VMs do that type of optimization? |
I guess if it's a lossless size win then it should be done, but if it's not lossless or not a size win then skewing for what a subset of VMs do at a specific point in time seems like the wrong approach. In those cases we should fix the VMs. |
Doing more thorough testing, it looks like disabling lsr removes most of the speed issues with the wasm backend. Without that pass, it's almost always within +-10% of asm2wasm (one exception, lzma, where it is 1.3x slower on sm and 2.5x slower on v8). Doing so also helps code size, but it's a very small factor there. Still e.g. 7.5% larger on box2d (or 3.5% larger if we binaryen -O it). So I'd recommend we disable that pass. If that makes sense, would the place to do it be in LLVM, or should we pass the flag to llc from emscripten? |
Followup PR: emscripten-core/emscripten#7386 |
I did another round of tests now to follow up on WebAssembly/binaryen#1054 , comparing v8 and sm, on the wasm backend with and without lsr. Disabling lsr reduces code size in all the benchmark suite, up to 1% (but usually less). It's also somewhat faster (5-10%) in a few microbenchmarks, on both VMs (so this is not a VM-specific issue). So seems worthwhile. LLVM bug: https://bugs.llvm.org/show_bug.cgi?id=39488
I did another round of tests now to follow up on WebAssembly/binaryen#1054 , comparing v8 and sm, on the wasm backend with and without lsr. Disabling lsr reduces code size in all the benchmark suite, up to 1% (but usually less). It's also somewhat faster (5-10%) in a few microbenchmarks, on both VMs (so this is not a VM-specific issue). So seems worthwhile. LLVM bug: https://bugs.llvm.org/show_bug.cgi?id=39488
I investigated this again now. Enabling LSR in the wasm backend increases code size by 1.5% on the emscripten benchmark suite, while the speed results are mixed - some things get faster, some get slower. As an about equal amount change in either direction, and this regresses size, I'm not sure it's worth changing. Example results: copy is 10% slower, fasta is 11% slower, memops is 33% faster, coremark is 5% faster. |
copy and memops are super micro-bencharky though, right? coremark is bigger, and fasta is kind of in the middle IIRC. IMO we should weight perf gains in the more "real" ones more heavily. |
oh also IIRC the V8 folks were experimenting with loop unrolling, though I don't remember if they turned it on or not. If they have that kind of infrastructure, maybe it's worth trying out something like this in the VM too. |
They're turned it on. But it has some suboptimal heuristics: |
Other random ideas: if this does turn out to trade size for speed, we could consider turning it on just at -O3. |
Coremark is more interesting than the microbenchmarks, fair point. On the non-microbenchmarks, Coremark and Linpack benefit while Zlib regresses. (While microbenchmarks matter less, that the results there seem to randomly help or hurt is pretty odd...) The VM might be able to do better here - at least it would avoid the download size regression. And maybe using the actual register number it could figure out if it makes sense to do this or not. |
It seems that this is more of an Emscripten or LLVM issue, so I'll go ahead and close it. |
Investigating differences between asm2wasm and wasm backend output, one cause of the latter's larger code sizes is that it emits loops over an array differently. For example, in a loop writing
array[x] = x
over an array of ints, asm2wasm emitswhile the wasm backend emits
The difference is that in asm2wasm we have one loop variable, the counter, and we calculate the address in the array in each iteration, using an add and a shift, whereas in the wasm backend there are two loop variables, a second for the array offset, and so instead of computing the array address we increment the second variable.
Both seem to run at around the same speed in a simple loop in firefox and chrome. But
Thoughts?
The text was updated successfully, but these errors were encountered: