Skip to content

Share trap mode between asm2wasm and s2wasm #1168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 54 commits into from
Oct 2, 2017
Merged

Conversation

jgravelle-google
Copy link
Contributor

Reimplement handling trapping float ops as a binaryen pass.

Test changes should be just:

  • Reordering generated functions (the timing of when they were added was changed)
  • Un-inlining some generated functions (inlining happens before the trap mode pass can)
  • Adding new functions for more general IR (e.g. f32-to-i32)

Copy link
Member

@dschuff dschuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked at asm2wasm.h yet.

Can we add at least one minimal s2wasm test too?

)


def test_asm2wasm():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any changes in this content, or just moved out from shared?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just moved from shared

*/

//
// Pass that supports potentially-trapping float operations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just a little more detail about what it actually does?


Expression* FloatTrap::makeTrappingUnary(Unary* curr) {
Name name = getUnaryFuncName(curr);
if (mode == Mode::Allow || !name.is()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would make sense the call to getUnaryFunctionName conditional on the mode being something other than allow? Then we wouldn't run the switch statement unless we knew we actually had to do something.

Come to think of it, why do we even have the allow mode? Doesn't that make the whole pass a no-op?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allow does make the whole pass a no-op. It seems cleaner than conditionally adding the pass if the flag is set.
Could probably do something more clever by not walking the module if Allow, by overriding one of the Walker methods. Would still introduce a sequence point by being another non-parallel pass.

// is precisely like JS but in order to do that we do a slow ffi
// If i64, there is no "JS" way to handle this, as no i64s in JS, so always clamp if we don't allow traps
if (curr->type != i64 && mode == Mode::JS) {
// WebAssembly traps on float-to-int overflows, but asm.js wouldn't, so we must emulate that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We aren't really "emulating" it at all, we are literally actually calling JS to do the conversion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost all of the comments were just grabbed as-is from asm2wasm

... which makes references to asm.js all the more misleading.

And "handle" is probably the better word here?

// If i64, there is no "JS" way to handle this, as no i64s in JS, so always clamp if we don't allow traps
if (curr->type != i64 && mode == Mode::JS) {
// WebAssembly traps on float-to-int overflows, but asm.js wouldn't, so we must emulate that
CallImport *ret = allocator.alloc<CallImport>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I want you to necessarily rewrite this whole patch, but would using a Builder make code significantly better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it does a853c0b , thanks

func->body = builder.makeUnary(truncOp,
builder.makeGetLocal(0, type)
);
// too small XXX this is different than asm.js, which does frem. here we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the 'XXX' meant for you to follow up on before commit?

@@ -81,6 +83,24 @@ int main(int argc, const char *argv[]) {
[&allowMemoryGrowth](Options *, const std::string &) {
allowMemoryGrowth = true;
})
.add("--emit-potential-traps", "",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that this matches asm2wasm, but I feel like we should have made this flag something like --trap-mode=<enum>. Like, if Allow is the default, this flag is really a no-op. Maybe a followup fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we don't have this hooked up in Emscripten yet, so now seems like a great time to make an asymmetric change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should keep it matching asm2wasm, though? perhaps we should change both

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can change both. But I agree that since we don't have the s2wasm one hooked up yet, that we might as well put what we finally want in s2wasm, then maybe fix asm2wasm in a followup PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we'll need to make an Emscripten patch to change the plumbing for asm2wasm anyway, we can bundle that with the patch to add plumbing for s2wasm, and we can keep them the same in Binaryen in the meantime.

So:
Keep this as is for now
Follow up with a patch to Binaryen to change both sets of args to --trap-mode=<blah>
At the same time submit a patch to Emscripten to add s2wasm's invocation and change asm2wasm's.

* See the License for the specific language governing permissions and
* limitations under the License.
*/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a comment explaining what the pass does

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep the .h and .cpp, should we duplicate the comment across both, leave it in the header, or move it to the implementation?

@@ -0,0 +1,70 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having a pass with a .h file feels wrong to me. a pass should be self-contained. how about a header file with the just the constants etc., probably under ast/?

for adding the pass to be run, can use the string name, instead of the class identifier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pass needs an argument though, which means we need to have the Mode enum and constructors visible to callers.

It's weird and I'm not sure where to put it. It isn't registered by name with the pass system, so it doesn't really belong in passes/, but it is a pass over the module. Originally I had it all in a header file in src/, but that seemed wrong.
ast/ seems fine to me, thanks.

Copy link
Member

@kripken kripken Sep 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making separate passes for the modes: one pass would do the clamp mode, and one would do the JS mode? We can define them all in a single pass cpp file (see e.g. simplify-locals pass which defines 4). Internally it would be a flag inside the pass class.

Then the users would add the right pass of the 2 (or not add any pass if we allow traps). Might add a little utility function for that in the header.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's way cleaner, and has the benefit of letting us run this in wasm-opt as well, which means I can write some finer-grained tests for it.

// Pass that supports potentially-trapping float operations.
//

#ifndef wasm_passes_FloatTrap_h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why float trap as a name - doesn't this handle int div/rem of 0 also, etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That it does. How about TrapModePass?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm terrible at names, but maybe src/ast/trapping.h, enum is TrapMode?

@@ -47,17 +46,11 @@
(get_local $0)
)
)
;;@ even-opted.cpp:3:0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are test results changing? should this be just a refactoring without functional changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My PR description was pretty sparse on detail, and I forgot to write down at least one case:

  • Because we change the timing of when we convert things to calls, optimizations happen in different orders. This lets constant folding be more effective, as seen in big_uint_div_u (in test/unit.fromasm), because we can constant fold the i32.div_u instruction directly, and then it doesn't generate a clamping call at all.

In this case, we add clamping after the inliner has run, so i32s-rem gets generated and called, instead of generated, used once, inlined, and removed.

Essentially: semantically these should all be the same, but the timing isn't identical and that perturbs the optimizer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, interesting; it's actually not obvious whether it's better to clamp as early as possible (since it does actually seem better to just inline and remove i32s-rem if there's only one use), or later (since const-folding could eliminate calls before they get generated). Anyway we can land this one way and maybe experiment with that later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah, looks like in big_uint_div_u we manage to optimize more. But I'm not sure this is what we want. I believe what's going on is that we have an unsigned -1 / 2 operation, which precompute happily precomputes, leaving the result. But if this operation required clamping or js-ing, then that precomputing would have been wrong, wouldn't it? For example if it traps, it would be precomputed into a trap.

Instead, could we run this pass first, before everything else? Probably will need to do that in two places: when optimizing, right before passRunner.add<AutoDrop>(), and when not optimizing, right before passRunner.add<FinalizeCalls>(this).

Side note, the test result from before this PR could also be optimized with a full constant propagation pass, which we don't have yet. We should probably add that eventually, marked #1172.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I currently run it as early as possible, which is necessarily after AutoDrop because the IR isn't valid before AutoDrop runs - asm2wasm seems to rely on AutoDrop cleaning it up, and we validate after each pass. We could run this separately in a separate PassRunner, or run AutoDrop and only AutoDrop separately in a separate PassRunner.

The other complication w/r/t timing in asm2wasm is the OptimizingIncrementalModuleBuilder that we use when running optimization passes that runs passes on each function - meaning the passes added to it need to be function-parallel, which we fundamentally can't make the clamping pass be because it adds functions to the module and thus introduces a sequence point for pass scheduling. Which we can probably work around with a separate pass runner. Maybe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other downside to adding the functions before any optimizations, is that then in the pass we have to explicitly exclude any generated functions when running on functions, because otherwise we'll replace the i32.div_u instruction in the $i32u-div function with a call to $i32u-div.

The pass also gains a precondition, "must have called helper function addTrappingFunctions before running this pass".

These both make the pass less pass-like, but probably does a better job of maintaining asm2wasm's behavior exactly as it is today. I was operating under the assumption that that's a non-goal as long as the semantics are the same.

Copy link
Member

@kripken kripken Sep 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree those two are downsides. But isn't the issue that we must do it that way to maintain the right semantics? Otherwise if the optimizer kicks in before we modify these math operations, it could optimize them incorrectly, as it's not seeing the correct math operations.

Otherwise we'd need to not run the optimizer until we run that pass, which would be substantially slower (we'd need to finish translating all functions before we start to optimize).

Another option might be to have some utility code that adds the functions, and also has code to do an individual fixup. And separately we'd have a pass that adds the functions and does a traversal to do all the fixups, using that. So the pass could be 100% self-contained. And asm2wasm would not use that pass, it would call the utility code directly to do the fixups during translation (as it does now, but in a call to shared code), and also call the utility method to add the functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds very plausible. Let me try to add a test that exposes that behavior.

I originally implemented this as a pass for s2wasm and asm2wasm just calling utility functions directly, basically changing nothing about its flow (see 8ffe1c9). I was able to use the pass for asm2wasm as well, which let me move the utility functions into the pass itself, which let me store the added functions on an instance variable on the pass object, instead of being a static variable, and generally make the interface much nicer.
Clean code does not trump working code, so I'll fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an INT_MIN / -1 test in 691cf5e. It does not currently misoptimize into a trap, and thinking about it a bit more I'm not sure if any optimization pass should ever do so?
Do these optimizations happen other than in OptimizeInstructions.cpp? Are there any patterns you can think of that would get optimized into an unconditional trap that I should check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm you're right, currently Precompute will handle constants and branches, but not traps. But it could handle traps some day, although I guess that's low priority (and the type of trap would be lost, leaving a generic (unreachable)).

An issue currently affecting us would be that the side effects might be different (e.g. an FFI into JS should have more side effects than possibly trapping).

Anyhow, we don't need a specific new test for that I guess, seeing the current test outputs are identical should be enough.

@jgravelle-google
Copy link
Contributor Author

I think this pass should be ready to merge at this point? Is there anything I'm missing?

func->body = builder.makeUnary(truncOp,
builder.makeGetLocal(0, type)
);
// too small XXX this is different than asm.js, which does frem. here we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can probably just remove the "XXX" and maybe say "float is too small"

kripken added a commit that referenced this pull request Sep 12, 2017
Implements #1172: this adds a variant of precompute, "precompute-propagate", which also does constant propagation. Precompute by itself just runs the interpreter on each expression and sees if it is in fact a constant; precompute-propagate also looks at the graph of connections between get and set locals, and propagates those constant values.

This helps with cases as noticed in #1168 - while in most cases LLVM will do this already, it's important when inlining, e.g. inlining of the clamping math functions. This new pass is run when inlining, and otherwise only in -O3/-Oz, as it does increase compilation time noticeably if run on everything (and for almost no benefit if LLVM has run).

Most of the code here is just refactoring out from the ssa pass the get/set graph computation, so it can now be used by both the ssa pass and precompute-propagate.
src/asm2wasm.h Outdated
@@ -1440,7 +1162,7 @@ void Asm2WasmBuilder::processAsm(Ref ast) {
if (importResult == f64) {
// we use a JS f64 value which is the most general, and convert to it
switch (old) {
case i32: replaceCurrent(parent->makeTrappingFloatToInt32(true /* signed, asm.js ffi */, curr)); break;
case i32: replaceCurrent(parent->builder.makeUnary(TruncSFloat64ToInt32, curr)); break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't we say we'd avoid using the pass in asm2wasm, and have it call directly to utility methods to do this? to avoid all the issues with having a pass be self-contained, and run at the right time, and all that?

maybe I misuderstood what we agreed on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I did. I thought we only cared about doing that in order to avoid potentially optimizing trapping opts into unconditional wasm traps.
As far as correctness goes, if that's not a concern then this is still fine.
This does change the visibility of the clamping to the optimizer, which I was assuming wasn't as important because of the inherent slowness of clamp/js modes as it was.

If we do care, I still think we should use a pass, with the following approach:

  1. change the pass to be function-parallel
  2. run it as early as possible (right after AutoDrop)
  3. generate all the generated functions all the time, because we don't know which ones we need
  4. let DCE remove the ones we don't need

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should care, yeah: if the optimizer works on incorrect code (before clamping was applied) then it doesn't see the right side effects, might precompute incorrectly, etc.

The plan 1-4 sounds ok. Hopefully with that the test outputs will be unchanged. (The only change I'd expect to see is that the order of functions might differ, since we add the helper functions themselves at a different time, that's ok of course.)

src/asm2wasm.h Outdated
@@ -1550,6 +1272,7 @@ void Asm2WasmBuilder::processAsm(Ref ast) {
// finalizeCalls also does autoDrop, which is crucial for the non-optimizing case,
// so that the output of the first pass is valid
passRunner.add<FinalizeCalls>(this);
addTrapModePass(passRunner, trapMode);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will run that pass after the functions have already been (mostly) optimized.

(get_local $2)
)
(i32.const 0)
(call $i32s-rem
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this used to be inlined, and now it isn't. if this is still the case after the other changes, that's worth investigating.

@jgravelle-google
Copy link
Contributor Author

Ok there we go.

So: what we discussed earlier turned out to be way trickier than I was expecting due to timing issues that I still don't understand. So I bailed on that and basically did what @kripken suggested in the first place, and just directly call the makeTrapping functions inline in asm2wasm, the way they were done before I touched it at all.
I did this by introducing a new helper class, GeneratedTrappingFunctions, which either collects the generated functions to add later, or it adds them immediately to the module. This lets us avoid any global/static variables, while letting us keep the desired functionality on either end, and reuse the code to boot.
The test diffs now are totally additive (s2wasm test case, plus the added trapping_sint_div_s case in unit.asm.js), or new functionality (handling f32-to-int in asm2wasm without having to convert to
f64s first, and [not yet implemented] double->uint32 conversion), which should be much more reasonable.
Only thing I'm unhappy with is the name: GeneratedTrappingFunctions. Any suggestions?

@jgravelle-google
Copy link
Contributor Author

In other fun news, rebasing this on master I run into a problem:

Running bin/asm2wasm test/unit.asm.js -O --mem-init=a.mem gives different results for the function trapping_sint_div_s when the environment variable BINARYEN_PASS_DEBUG is 1. With it set, the function gets const-folded into just (i32.const 0), otherwise it calls the generated function as usual.

We only added it to test that trapping divisions would get
constant-folded at the correct time. Now that we're not changing the
timing of trapping modes, the test is unneeded (and problematic).
@kripken
Copy link
Member

kripken commented Sep 29, 2017

That's... very strange. Pass debug mode should never change behavior. It just forces single-threaded mode + extra validation. Perhaps this is because of single-threaded mode, does BINARYEN_CORES=1 cause the same odd result? If so maybe there's a threading bug.

@jgravelle-google
Copy link
Contributor Author

Yep, happens with BINARYEN_CORES=1 too.
I figure the problem is that we're adding functions while we're iterating over the module in asm2wasm, therefore the call target isn't visible when working in parallel, therefore it can't be const-folded. But when debugging or single-threading, that's less of a problem.

In any case I only added the trapping_sint_div_s test because I was changing the order of optimizations/inlining with respect to trapping functions. So I just removed the offending test because that's not necessarily something we care about now that the timing is the same and other tests haven't changed.

@kripken
Copy link
Member

kripken commented Sep 29, 2017

Ok, sounds like it won't block this PR then, but can you make me a standalone testcase of the issue perhaps? I don't fully understand but it sounds like this could be a serious existing bug, so I'd like to investigate.

@jgravelle-google
Copy link
Contributor Author

Sure, the test I added in 33cbbc8 causes the problem. Just tested it on master and it reproduces the issue.

I think it's because asm2wasm is creating new functions while running optimizations on functions in the background.

@kripken
Copy link
Member

kripken commented Oct 2, 2017

Thanks, and yeah, you're right, it's related to the optimization and adding order. Turns out we add those trapping functions while translating the functions, and they don't get in the queue t o be optimized. Except if we optimize at the end all together anyhow - if no multithreading - in which case they do.

Mostly we don't need to optimize these methods anyhow, but your testcase actually shows an example of where we do (inlining limit is hit or not hit based on whether we optimized). So seems worth fixing, fix in #1208.

@jgravelle-google
Copy link
Contributor Author

Hm, can we land this before #1208? That fix is going to look very different as a result of this PR.

Anything else outstanding here or should I merge?

@kripken
Copy link
Member

kripken commented Oct 2, 2017

Yeah, merge this first. We'll figure out those followups later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants