Skip to content

Fix long-range (non-colocated) aarch64 calls to not use Arm64Call reloc, and fix simplejit to use new long-distance call. #1570

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Apr 21, 2020

Previously, every call was lowered on AArch64 to a call instruction, which
takes a signed 26-bit PC-relative offset. Including the 2-bit left shift, this
gives a range of +/- 128 MB. Longer-distance offsets would cause an impossible
relocation record to be emitted (or rather, a record that a more sophisticated
linker would fix up by inserting a shim/veneer).

This commit adds a notion of "relocation distance" in the MachInst backends,
and provides this information for every call target and symbol reference. The
intent is that backends on architectures like AArch64, where there are different
offset sizes / addressing strategies to choose from, can either emit a regular
call or a load-64-bit-constant / call-indirect sequence, as necessary. This
avoids the need to implement complex linking behavior.

The MachInst driver code provides this information based on the "colocated" bit
in the CLIF symbol references, which appears to have been designed for this
purpose, or at least a similar one. Combined with the use_colocated_libcalls
setting, this allows client code to ensure that library calls can link to
library code at any location in the address space.

Separately, the simplejit example did not handle Arm64Call; rather than doing
so, it appears all that is necessary to get its tests to pass is to set the
use_colocated_libcalls flag to false, to make use of the above change. This
fixes the libcall_function unit-test in this crate.

@cfallin cfallin added the cranelift:area:aarch64 Issues related to AArch64 backend. label Apr 21, 2020
@cfallin cfallin force-pushed the fix-long-range-aarch64-call branch 2 times, most recently from 4d78b40 to 4d721d0 Compare April 21, 2020 19:40
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:module labels Apr 21, 2020
@github-actions
Copy link

Subscribe to Label Action

cc @bnjbvr

This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:module"

Thus the following users have been cc'd because of the following labels:

  • bnjbvr: cranelift

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@sunfishcode
Copy link
Member

At a quick glance, this does indeed look consistent with the intent of the colocated flag.

@bjorn3
Copy link
Contributor

bjorn3 commented Apr 21, 2020

As far as I know the colocated flag is meant for functions that will be at a fixed offset from the current function, so they could use PCREL relocations, rather than using a GOT. I don't think it necessarily means that it is "near". For example a fully statically linked binary could always use colocated even when it is 4GB big.

@cfallin
Copy link
Member Author

cfallin commented Apr 21, 2020

the colocated flag is meant for functions that will be at a fixed offset

Hmm -- given that, it seems there isn't a CLIF-level notion of "in the same module"? Perhaps we can add a new bit to calls and symbol_values -- I'm not sure what to call it, but we somehow need to communicate the notion of "call into another function in the same Wasm module, which due to Wasm bytecode size limits should be in range of most RISC ISAs' call instructions" (EDIT: for the Wasm use-case, at least; for e.g. AOT-into-object-file use-cases, we just emit the right relocs and everything works).

It seems the effect of colocated is what we want, at least going off of the description: non-colocated implies indirection through a table, while colocated implies direct PC-rel reference.

Alternately, we could consider just doing the right thing and implementing the veneer-insertion linker behavior; that that puts a larger burden on the client, and also breaks invariants around "this blob of machine code from the backend is a fixed-size blob that will never need to be extended with thunks at link time".

@sunfishcode
Copy link
Member

@bjorn3 Ah, that's true. That's what I get for jumping in without full context here :-}.

cfallin added a commit to cfallin/wasmtime that referenced this pull request Apr 22, 2020
This change adds SourceLoc information per instruction in a `VCode<Inst>`
container, and keeps this information up-to-date across register allocation
and branch reordering. The information is initially collected during
instruction lowering, eventually collected on the MachSection, and finally
provided to the environment that wraps the codegen crate for wasmtime.

This PR is based on top of bytecodealliance#1570 and bytecodealliance#1571 (part of a series fixing tests).

This PR depends on wasmtime/regalloc.rs#50, a change to the register
allocator to provide instruction-granularity info on the rewritten
instruction stream (rather than block-granularity).

With the prior PRs applied as well, quite a few more unit tests pass;
the exclusion list in bytecodealliance#1526 should be updated if this PR lands first.
Copy link
Member

@bnjbvr bnjbvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the definition of colocated, at least to my understanding (I never had to interact with it in the past).

So if I understand correctly, the Cranelift's users may still say that all the function calls call to colocated functions, as long as they insert the veneers, right? (If so, this should work as is in Spidermonkey)

It would be nice to have a way to signal the users that a call actually require a veneer; but this is probably a job for the object/simplejit et al. crates, not for Cranelift itself.

(/me starts to think about reordering functions within sections so as to minimize the need for veneers)

LGTM in any case, thanks!

@cfallin
Copy link
Member Author

cfallin commented Apr 22, 2020

@bnjbvr re:

I agree with the definition of colocated, at least to my understanding

To make sure I understand -- you're agreeing with the initial assertion that colocated seems in practice to mean "reference inside module that can use direct PC-rel references" vs. "reference to another module that needs to go through some sort of indirection / support arbitrary addresses"? Or the later definition clarified by @bjorn3 that it is just "constant PC-rel offset" (but arbitrarily far away)?

I think the basic question is whether we can nudge the definition toward the former -- more of a "same module" vs. "different module" bit, from which we can infer approximate relocation distance (given module size limits), or whether we need another bit / attribute for this.

Thoughts?

@bnjbvr
Copy link
Member

bnjbvr commented Apr 22, 2020

The former, precisely (direct PCRel call vs load from table + indirect call); at least this seems to be the way we set it in Spidermonkey. We could imagine having a different flag for the latter, but I think that's out of scope for this PR.

@cfallin
Copy link
Member Author

cfallin commented Apr 22, 2020

Updated -- just want to make sure we're all OK with the refined meaning of colocated here -- @sunfishcode, OK to do this (with extra doc-comment note on the bool flag definitions) or would you prefer something else?

@cfallin cfallin force-pushed the fix-long-range-aarch64-call branch from cca936d to fe35934 Compare April 23, 2020 20:23
cfallin added a commit to cfallin/wasmtime that referenced this pull request Apr 23, 2020
This change adds SourceLoc information per instruction in a `VCode<Inst>`
container, and keeps this information up-to-date across register allocation
and branch reordering. The information is initially collected during
instruction lowering, eventually collected on the MachSection, and finally
provided to the environment that wraps the codegen crate for wasmtime.

This PR is based on top of bytecodealliance#1570 and bytecodealliance#1571 (part of a series fixing tests).

This PR depends on wasmtime/regalloc.rs#50, a change to the register
allocator to provide instruction-granularity info on the rewritten
instruction stream (rather than block-granularity).

With the prior PRs applied as well, quite a few more unit tests pass;
the exclusion list in bytecodealliance#1526 should be updated if this PR lands first.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Apr 24, 2020
This change adds SourceLoc information per instruction in a `VCode<Inst>`
container, and keeps this information up-to-date across register allocation
and branch reordering. The information is initially collected during
instruction lowering, eventually collected on the MachSection, and finally
provided to the environment that wraps the codegen crate for wasmtime.

This PR is based on top of bytecodealliance#1570 and bytecodealliance#1571 (part of a series fixing tests).

This PR depends on wasmtime/regalloc.rs#50, a change to the register
allocator to provide instruction-granularity info on the rewritten
instruction stream (rather than block-granularity).

With the prior PRs applied as well, quite a few more unit tests pass;
the exclusion list in bytecodealliance#1526 should be updated if this PR lands first.
@cfallin
Copy link
Member Author

cfallin commented Apr 28, 2020

@sunfishcode -- friendly ping, could you verify whether you're OK with this interpretation of colocated?

@cfallin cfallin force-pushed the fix-long-range-aarch64-call branch from fe35934 to a369b7b Compare May 5, 2020 00:29
@cfallin
Copy link
Member Author

cfallin commented May 5, 2020

Rebased and added a more detailed doc comment to the colocated flag, as per a conversation with @sunfishcode just now. Will merge once the tests are green. In the longer term, we'll need to think a bit more about how to support different code models, beyond the simple RelocDistance here; I'll open an issue for that.

@bnjbvr
Copy link
Member

bnjbvr commented May 5, 2020

as per a conversation with @sunfishcode just now.

Where did this conversation happen? I can't find any trace in all the public channels where I'm hanging out. Could the contents of this discussion be summarized somewhere? @sunfishcode @cfallin

@cfallin cfallin force-pushed the fix-long-range-aarch64-call branch from e06a50f to 692f9e4 Compare May 5, 2020 14:23
@cfallin
Copy link
Member Author

cfallin commented May 5, 2020

Sorry, this was from a 1:1 IM conversation on Zulip, after I had pinged about the above; I should've asked for a comment here for the record!

Here's a transcript:

@sunfishcode: I must apologize, I don't have enough context to answer this, nor time at this moment to context switch and page it in.
@sunfishcode: Potentially related is the concept of "code models", https://gcc.gnu.org/onlinedocs/gcc/AArch64-Options.html#index-mcmodel_003dsmall
@sunfishcode: although judging by the comments, what Cranelift is doing here is different from any of GCC's defined code models
@sunfishcode: but eventually, I expect Cranelift will need to evolve in the direction of supporting the standard code models, possibly in addition to its own custom ones
@cfallin: OK, no worries. I had been waiting in case you had more to say, but perhaps it's best to merge it with Ben's existing approval, then address the code-model issue later (I can open an issue for it)?
@sunfishcode: Yeah. Ok, what if you added something like this to the comment on the colocated flag:
@sunfishcode: "The exact distance depends on the code model in use. Currently on AArch64 Cranelift uses a custom code model supporting up to +/- byte displacements"
@sunfishcode: or so

…oc, and fix simplejit to use it.

Previously, every call was lowered on AArch64 to a `call` instruction, which
takes a signed 26-bit PC-relative offset. Including the 2-bit left shift, this
gives a range of +/- 128 MB. Longer-distance offsets would cause an impossible
relocation record to be emitted (or rather, a record that a more sophisticated
linker would fix up by inserting a shim/veneer).

This commit adds a notion of "relocation distance" in the MachInst backends,
and provides this information for every call target and symbol reference. The
intent is that backends on architectures like AArch64, where there are different
offset sizes / addressing strategies to choose from, can either emit a regular
call or a load-64-bit-constant / call-indirect sequence, as necessary. This
avoids the need to implement complex linking behavior.

The MachInst driver code provides this information based on the "colocated" bit
in the CLIF symbol references, which appears to have been designed for this
purpose, or at least a similar one. Combined with the `use_colocated_libcalls`
setting, this allows client code to ensure that library calls can link to
library code at any location in the address space.

Separately, the `simplejit` example did not handle `Arm64Call`; rather than doing
so, it appears all that is necessary to get its tests to pass is to set the
`use_colocated_libcalls` flag to false, to make use of the above change. This
fixes the `libcall_function` unit-test in this crate.
@cfallin cfallin force-pushed the fix-long-range-aarch64-call branch from 692f9e4 to e39b4ab Compare May 5, 2020 16:55
@cfallin cfallin merged commit 59039df into bytecodealliance:master May 5, 2020
@cfallin cfallin deleted the fix-long-range-aarch64-call branch May 6, 2020 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:module cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants