-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Fix long-range (non-colocated) aarch64 calls to not use Arm64Call reloc, and fix simplejit to use new long-distance call. #1570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix long-range (non-colocated) aarch64 calls to not use Arm64Call reloc, and fix simplejit to use new long-distance call. #1570
Conversation
4d78b40
to
4d721d0
Compare
Subscribe to Label Actioncc @bnjbvr
This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:module"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
At a quick glance, this does indeed look consistent with the intent of the |
As far as I know the |
Hmm -- given that, it seems there isn't a CLIF-level notion of "in the same module"? Perhaps we can add a new bit to It seems the effect of Alternately, we could consider just doing the right thing and implementing the veneer-insertion linker behavior; that that puts a larger burden on the client, and also breaks invariants around "this blob of machine code from the backend is a fixed-size blob that will never need to be extended with thunks at link time". |
@bjorn3 Ah, that's true. That's what I get for jumping in without full context here :-}. |
This change adds SourceLoc information per instruction in a `VCode<Inst>` container, and keeps this information up-to-date across register allocation and branch reordering. The information is initially collected during instruction lowering, eventually collected on the MachSection, and finally provided to the environment that wraps the codegen crate for wasmtime. This PR is based on top of bytecodealliance#1570 and bytecodealliance#1571 (part of a series fixing tests). This PR depends on wasmtime/regalloc.rs#50, a change to the register allocator to provide instruction-granularity info on the rewritten instruction stream (rather than block-granularity). With the prior PRs applied as well, quite a few more unit tests pass; the exclusion list in bytecodealliance#1526 should be updated if this PR lands first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the definition of colocated
, at least to my understanding (I never had to interact with it in the past).
So if I understand correctly, the Cranelift's users may still say that all the function calls call to colocated functions, as long as they insert the veneers, right? (If so, this should work as is in Spidermonkey)
It would be nice to have a way to signal the users that a call actually require a veneer; but this is probably a job for the object/simplejit et al. crates, not for Cranelift itself.
(/me starts to think about reordering functions within sections so as to minimize the need for veneers)
LGTM in any case, thanks!
@bnjbvr re:
To make sure I understand -- you're agreeing with the initial assertion that I think the basic question is whether we can nudge the definition toward the former -- more of a "same module" vs. "different module" bit, from which we can infer approximate relocation distance (given module size limits), or whether we need another bit / attribute for this. Thoughts? |
The former, precisely (direct PCRel call vs load from table + indirect call); at least this seems to be the way we set it in Spidermonkey. We could imagine having a different flag for the latter, but I think that's out of scope for this PR. |
Updated -- just want to make sure we're all OK with the refined meaning of |
cca936d
to
fe35934
Compare
This change adds SourceLoc information per instruction in a `VCode<Inst>` container, and keeps this information up-to-date across register allocation and branch reordering. The information is initially collected during instruction lowering, eventually collected on the MachSection, and finally provided to the environment that wraps the codegen crate for wasmtime. This PR is based on top of bytecodealliance#1570 and bytecodealliance#1571 (part of a series fixing tests). This PR depends on wasmtime/regalloc.rs#50, a change to the register allocator to provide instruction-granularity info on the rewritten instruction stream (rather than block-granularity). With the prior PRs applied as well, quite a few more unit tests pass; the exclusion list in bytecodealliance#1526 should be updated if this PR lands first.
This change adds SourceLoc information per instruction in a `VCode<Inst>` container, and keeps this information up-to-date across register allocation and branch reordering. The information is initially collected during instruction lowering, eventually collected on the MachSection, and finally provided to the environment that wraps the codegen crate for wasmtime. This PR is based on top of bytecodealliance#1570 and bytecodealliance#1571 (part of a series fixing tests). This PR depends on wasmtime/regalloc.rs#50, a change to the register allocator to provide instruction-granularity info on the rewritten instruction stream (rather than block-granularity). With the prior PRs applied as well, quite a few more unit tests pass; the exclusion list in bytecodealliance#1526 should be updated if this PR lands first.
@sunfishcode -- friendly ping, could you verify whether you're OK with this interpretation of |
fe35934
to
a369b7b
Compare
Rebased and added a more detailed doc comment to the |
a369b7b
to
e06a50f
Compare
Where did this conversation happen? I can't find any trace in all the public channels where I'm hanging out. Could the contents of this discussion be summarized somewhere? @sunfishcode @cfallin |
e06a50f
to
692f9e4
Compare
Sorry, this was from a 1:1 IM conversation on Zulip, after I had pinged about the above; I should've asked for a comment here for the record! Here's a transcript:
|
…oc, and fix simplejit to use it. Previously, every call was lowered on AArch64 to a `call` instruction, which takes a signed 26-bit PC-relative offset. Including the 2-bit left shift, this gives a range of +/- 128 MB. Longer-distance offsets would cause an impossible relocation record to be emitted (or rather, a record that a more sophisticated linker would fix up by inserting a shim/veneer). This commit adds a notion of "relocation distance" in the MachInst backends, and provides this information for every call target and symbol reference. The intent is that backends on architectures like AArch64, where there are different offset sizes / addressing strategies to choose from, can either emit a regular call or a load-64-bit-constant / call-indirect sequence, as necessary. This avoids the need to implement complex linking behavior. The MachInst driver code provides this information based on the "colocated" bit in the CLIF symbol references, which appears to have been designed for this purpose, or at least a similar one. Combined with the `use_colocated_libcalls` setting, this allows client code to ensure that library calls can link to library code at any location in the address space. Separately, the `simplejit` example did not handle `Arm64Call`; rather than doing so, it appears all that is necessary to get its tests to pass is to set the `use_colocated_libcalls` flag to false, to make use of the above change. This fixes the `libcall_function` unit-test in this crate.
692f9e4
to
e39b4ab
Compare
Previously, every call was lowered on AArch64 to a
call
instruction, whichtakes a signed 26-bit PC-relative offset. Including the 2-bit left shift, this
gives a range of +/- 128 MB. Longer-distance offsets would cause an impossible
relocation record to be emitted (or rather, a record that a more sophisticated
linker would fix up by inserting a shim/veneer).
This commit adds a notion of "relocation distance" in the MachInst backends,
and provides this information for every call target and symbol reference. The
intent is that backends on architectures like AArch64, where there are different
offset sizes / addressing strategies to choose from, can either emit a regular
call or a load-64-bit-constant / call-indirect sequence, as necessary. This
avoids the need to implement complex linking behavior.
The MachInst driver code provides this information based on the "colocated" bit
in the CLIF symbol references, which appears to have been designed for this
purpose, or at least a similar one. Combined with the
use_colocated_libcalls
setting, this allows client code to ensure that library calls can link to
library code at any location in the address space.
Separately, the
simplejit
example did not handleArm64Call
; rather than doingso, it appears all that is necessary to get its tests to pass is to set the
use_colocated_libcalls
flag to false, to make use of the above change. Thisfixes the
libcall_function
unit-test in this crate.