Skip to content

Loop array variable optimization #1054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kripken opened this issue Jun 16, 2017 · 14 comments
Closed

Loop array variable optimization #1054

kripken opened this issue Jun 16, 2017 · 14 comments

Comments

@kripken
Copy link
Member

kripken commented Jun 16, 2017

Investigating differences between asm2wasm and wasm backend output, one cause of the latter's larger code sizes is that it emits loops over an array differently. For example, in a loop writing array[x] = x over an array of ints, asm2wasm emits

    (set_local $1
     (i32.const 0)
    )
    (loop $label$2
     (i32.store
      (i32.add
       (get_local $4)
       (i32.shl
        (get_local $1)
        (i32.const 2)
       )
      )
      (get_local $1)
     )
     (br_if $label$2
      (i32.ne
       (get_local $3)
       (tee_local $1
        (i32.add
         (get_local $1)
         (i32.const 1)
        )
       )
      )
     )
    )

while the wasm backend emits

    (set_local $6
     (i32.const 0)
    )
    (set_local $5
     (get_local $3)
    )
    (loop $label$5
     (i32.store
      (get_local $5)
      (get_local $6)
     )
     (set_local $5
      (i32.add
       (get_local $5)
       (i32.const 4)
      )
     )
     (br_if $label$5
      (i32.ne
       (get_local $0)
       (tee_local $6
        (i32.add
         (get_local $6)
         (i32.const 1)
        )
       )
      )
     )
    )

The difference is that in asm2wasm we have one loop variable, the counter, and we calculate the address in the array in each iteration, using an add and a shift, whereas in the wasm backend there are two loop variables, a second for the array offset, and so instead of computing the array address we increment the second variable.

Both seem to run at around the same speed in a simple loop in firefox and chrome. But

  • the wasm backend's approach is larger, 74 vs 69 bytes for that loop example, and
  • uses two locals instead of 1, which I suspect might explain part of why fannkuch is slower there, as it has multiply-nested such loops, so it's keeping more variables alive across large areas, possibly causing spilling.

Thoughts?

@sunfishcode
Copy link
Member

The pass which does this in LLVM is LoopStrengthReduce, which is indeed not run in the fastcomp backend and is run in the wasm backend.

LLVM appears to prefer two induction variables in this case because it eliminates a shl inside the loop. I don't believe current wasm engines are capable of optimizing this shl away, so it should, in theory, be slightly faster, except for the potential register pressure.

If you want to experiment with disabling this pass, you can add -disable-lsr to the llc command.

@kripken
Copy link
Member Author

kripken commented Jun 16, 2017

Thanks, it does look like that's a factor here. Disabling that pass makes fannkuch 7.5% faster and 1.2% smaller.

@jfbastien
Copy link
Member

Thanks, it does look like that's a factor here. Disabling that pass makes fannkuch 7.5% faster and 1.2% smaller.

Where is it 7.5% faster?

@kripken
Copy link
Member Author

kripken commented Jun 17, 2017

I tested SpiderMonkey and v8.

Which I guess raises the question, do any wasm VMs do that type of optimization?

@jfbastien
Copy link
Member

I guess if it's a lossless size win then it should be done, but if it's not lossless or not a size win then skewing for what a subset of VMs do at a specific point in time seems like the wrong approach. In those cases we should fix the VMs.

@kripken
Copy link
Member Author

kripken commented Jun 17, 2017

Doing more thorough testing, it looks like disabling lsr removes most of the speed issues with the wasm backend. Without that pass, it's almost always within +-10% of asm2wasm (one exception, lzma, where it is 1.3x slower on sm and 2.5x slower on v8).

Doing so also helps code size, but it's a very small factor there. Still e.g. 7.5% larger on box2d (or 3.5% larger if we binaryen -O it).

So I'd recommend we disable that pass. If that makes sense, would the place to do it be in LLVM, or should we pass the flag to llc from emscripten?

@kripken
Copy link
Member Author

kripken commented Oct 24, 2018

Followup PR: emscripten-core/emscripten#7386

kripken added a commit to emscripten-core/emscripten that referenced this issue Oct 31, 2018
I did another round of tests now to follow up on WebAssembly/binaryen#1054 , comparing v8 and sm, on the wasm backend with and without lsr. Disabling lsr reduces code size in all the benchmark suite, up to 1% (but usually less). It's also somewhat faster (5-10%) in a few microbenchmarks, on both VMs (so this is not a VM-specific issue). So seems worthwhile.

LLVM bug: https://bugs.llvm.org/show_bug.cgi?id=39488
Beuc pushed a commit to Beuc/emscripten that referenced this issue Nov 17, 2018
I did another round of tests now to follow up on WebAssembly/binaryen#1054 , comparing v8 and sm, on the wasm backend with and without lsr. Disabling lsr reduces code size in all the benchmark suite, up to 1% (but usually less). It's also somewhat faster (5-10%) in a few microbenchmarks, on both VMs (so this is not a VM-specific issue). So seems worthwhile.

LLVM bug: https://bugs.llvm.org/show_bug.cgi?id=39488
@kripken
Copy link
Member Author

kripken commented Nov 17, 2021

I investigated this again now. Enabling LSR in the wasm backend increases code size by 1.5% on the emscripten benchmark suite, while the speed results are mixed - some things get faster, some get slower. As an about equal amount change in either direction, and this regresses size, I'm not sure it's worth changing.

Example results: copy is 10% slower, fasta is 11% slower, memops is 33% faster, coremark is 5% faster.

@dschuff
Copy link
Member

dschuff commented Nov 17, 2021

copy and memops are super micro-bencharky though, right? coremark is bigger, and fasta is kind of in the middle IIRC. IMO we should weight perf gains in the more "real" ones more heavily.

@dschuff
Copy link
Member

dschuff commented Nov 17, 2021

oh also IIRC the V8 folks were experimenting with loop unrolling, though I don't remember if they turned it on or not. If they have that kind of infrastructure, maybe it's worth trying out something like this in the VM too.

@MaxGraey
Copy link
Contributor

though I don't remember if they turned it on or not

They're turned it on. But it has some suboptimal heuristics:
https://bugs.chromium.org/p/v8/issues/detail?id=12186&q=maxgraey&can=2

@dschuff
Copy link
Member

dschuff commented Nov 17, 2021

Other random ideas: if this does turn out to trade size for speed, we could consider turning it on just at -O3.

@kripken
Copy link
Member Author

kripken commented Nov 18, 2021

Coremark is more interesting than the microbenchmarks, fair point. On the non-microbenchmarks, Coremark and Linpack benefit while Zlib regresses.

(While microbenchmarks matter less, that the results there seem to randomly help or hurt is pretty odd...)

The VM might be able to do better here - at least it would avoid the download size regression. And maybe using the actual register number it could figure out if it makes sense to do this or not.

@tlively
Copy link
Member

tlively commented Jan 11, 2025

It seems that this is more of an Emscripten or LLVM issue, so I'll go ahead and close it.

@tlively tlively closed this as completed Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants