Loop array variable optimization #1054

kripken · 2017-06-16T21:23:43Z

Investigating differences between asm2wasm and wasm backend output, one cause of the latter's larger code sizes is that it emits loops over an array differently. For example, in a loop writing array[x] = x over an array of ints, asm2wasm emits

    (set_local $1
     (i32.const 0)
    )
    (loop $label$2
     (i32.store
      (i32.add
       (get_local $4)
       (i32.shl
        (get_local $1)
        (i32.const 2)
       )
      )
      (get_local $1)
     )
     (br_if $label$2
      (i32.ne
       (get_local $3)
       (tee_local $1
        (i32.add
         (get_local $1)
         (i32.const 1)
        )
       )
      )
     )
    )

while the wasm backend emits

    (set_local $6
     (i32.const 0)
    )
    (set_local $5
     (get_local $3)
    )
    (loop $label$5
     (i32.store
      (get_local $5)
      (get_local $6)
     )
     (set_local $5
      (i32.add
       (get_local $5)
       (i32.const 4)
      )
     )
     (br_if $label$5
      (i32.ne
       (get_local $0)
       (tee_local $6
        (i32.add
         (get_local $6)
         (i32.const 1)
        )
       )
      )
     )
    )

The difference is that in asm2wasm we have one loop variable, the counter, and we calculate the address in the array in each iteration, using an add and a shift, whereas in the wasm backend there are two loop variables, a second for the array offset, and so instead of computing the array address we increment the second variable.

Both seem to run at around the same speed in a simple loop in firefox and chrome. But

the wasm backend's approach is larger, 74 vs 69 bytes for that loop example, and
uses two locals instead of 1, which I suspect might explain part of why fannkuch is slower there, as it has multiply-nested such loops, so it's keeping more variables alive across large areas, possibly causing spilling.

Thoughts?

The text was updated successfully, but these errors were encountered:

sunfishcode · 2017-06-16T21:52:04Z

The pass which does this in LLVM is LoopStrengthReduce, which is indeed not run in the fastcomp backend and is run in the wasm backend.

LLVM appears to prefer two induction variables in this case because it eliminates a shl inside the loop. I don't believe current wasm engines are capable of optimizing this shl away, so it should, in theory, be slightly faster, except for the potential register pressure.

If you want to experiment with disabling this pass, you can add -disable-lsr to the llc command.

kripken · 2017-06-16T23:35:49Z

Thanks, it does look like that's a factor here. Disabling that pass makes fannkuch 7.5% faster and 1.2% smaller.

jfbastien · 2017-06-16T23:55:34Z

Thanks, it does look like that's a factor here. Disabling that pass makes fannkuch 7.5% faster and 1.2% smaller.

Where is it 7.5% faster?

kripken · 2017-06-17T00:02:38Z

I tested SpiderMonkey and v8.

Which I guess raises the question, do any wasm VMs do that type of optimization?

jfbastien · 2017-06-17T03:42:09Z

I guess if it's a lossless size win then it should be done, but if it's not lossless or not a size win then skewing for what a subset of VMs do at a specific point in time seems like the wrong approach. In those cases we should fix the VMs.

kripken · 2017-06-17T20:06:13Z

Doing more thorough testing, it looks like disabling lsr removes most of the speed issues with the wasm backend. Without that pass, it's almost always within +-10% of asm2wasm (one exception, lzma, where it is 1.3x slower on sm and 2.5x slower on v8).

Doing so also helps code size, but it's a very small factor there. Still e.g. 7.5% larger on box2d (or 3.5% larger if we binaryen -O it).

So I'd recommend we disable that pass. If that makes sense, would the place to do it be in LLVM, or should we pass the flag to llc from emscripten?

kripken · 2018-10-24T23:31:51Z

Followup PR: emscripten-core/emscripten#7386

I did another round of tests now to follow up on WebAssembly/binaryen#1054 , comparing v8 and sm, on the wasm backend with and without lsr. Disabling lsr reduces code size in all the benchmark suite, up to 1% (but usually less). It's also somewhat faster (5-10%) in a few microbenchmarks, on both VMs (so this is not a VM-specific issue). So seems worthwhile. LLVM bug: https://bugs.llvm.org/show_bug.cgi?id=39488

kripken · 2021-11-17T21:14:17Z

I investigated this again now. Enabling LSR in the wasm backend increases code size by 1.5% on the emscripten benchmark suite, while the speed results are mixed - some things get faster, some get slower. As an about equal amount change in either direction, and this regresses size, I'm not sure it's worth changing.

Example results: copy is 10% slower, fasta is 11% slower, memops is 33% faster, coremark is 5% faster.

dschuff · 2021-11-17T22:00:12Z

copy and memops are super micro-bencharky though, right? coremark is bigger, and fasta is kind of in the middle IIRC. IMO we should weight perf gains in the more "real" ones more heavily.

dschuff · 2021-11-17T22:00:59Z

oh also IIRC the V8 folks were experimenting with loop unrolling, though I don't remember if they turned it on or not. If they have that kind of infrastructure, maybe it's worth trying out something like this in the VM too.

MaxGraey · 2021-11-17T22:04:03Z

though I don't remember if they turned it on or not

They're turned it on. But it has some suboptimal heuristics:
https://bugs.chromium.org/p/v8/issues/detail?id=12186&q=maxgraey&can=2

dschuff · 2021-11-17T22:12:14Z

Other random ideas: if this does turn out to trade size for speed, we could consider turning it on just at -O3.

kripken · 2021-11-18T01:03:52Z

Coremark is more interesting than the microbenchmarks, fair point. On the non-microbenchmarks, Coremark and Linpack benefit while Zlib regresses.

(While microbenchmarks matter less, that the results there seem to randomly help or hurt is pretty odd...)

The VM might be able to do better here - at least it would avoid the download size regression. And maybe using the actual register number it could figure out if it makes sense to do this or not.

tlively · 2025-01-11T22:35:44Z

It seems that this is more of an Emscripten or LLVM issue, so I'll go ahead and close it.

kripken mentioned this issue Aug 18, 2017

Switching to use upstream LLVM backend by default emscripten-core/emscripten#5488

Closed

kripken mentioned this issue Oct 24, 2018

Disable lsr in the wasm backend [ci skip] emscripten-core/emscripten#7386

Merged

tlively closed this as completed Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loop array variable optimization #1054

Loop array variable optimization #1054

kripken commented Jun 16, 2017

sunfishcode commented Jun 16, 2017

kripken commented Jun 16, 2017

jfbastien commented Jun 16, 2017

kripken commented Jun 17, 2017

jfbastien commented Jun 17, 2017

kripken commented Jun 17, 2017

kripken commented Oct 24, 2018

kripken commented Nov 17, 2021

dschuff commented Nov 17, 2021

dschuff commented Nov 17, 2021

MaxGraey commented Nov 17, 2021

dschuff commented Nov 17, 2021

kripken commented Nov 18, 2021

tlively commented Jan 11, 2025

Loop array variable optimization #1054

Loop array variable optimization #1054

Comments

kripken commented Jun 16, 2017

sunfishcode commented Jun 16, 2017

kripken commented Jun 16, 2017

jfbastien commented Jun 16, 2017

kripken commented Jun 17, 2017

jfbastien commented Jun 17, 2017

kripken commented Jun 17, 2017

kripken commented Oct 24, 2018

kripken commented Nov 17, 2021

dschuff commented Nov 17, 2021

dschuff commented Nov 17, 2021

MaxGraey commented Nov 17, 2021

dschuff commented Nov 17, 2021

kripken commented Nov 18, 2021

tlively commented Jan 11, 2025