Skip to content

Commit b3e52ec

Browse files
authored
Merge pull request #51 from sysprog21/cancel-wrapper
Accelerate open+close on aarch64 rewrite mode
2 parents 0000883 + 000015c commit b3e52ec

File tree

8 files changed

+961
-189
lines changed

8 files changed

+961
-189
lines changed

docs/cancel-wrapper.md

Lines changed: 239 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,239 @@
1+
# cancel-wrapper fast path (aarch64)
2+
3+
## What this is
4+
5+
A fast path that intercepts calls to musl's `__syscall_cancel` from
6+
inside libc syscall wrappers and replaces them with a direct dispatch
7+
into kbox's rewrite trampoline. The win is concentrated on aarch64
8+
`open+close`, which is the only remaining non-near-native row in the
9+
rewrite baseline (`47.3us` vs `~1.5us` for the other rewritten
10+
syscalls; see `TODO.md`). The reason it lags is that musl's `open()`
11+
wrapper does not call the kernel directly: it goes through
12+
`__syscall_cancel`, and the rewriter's per-SVC-site patcher only sees
13+
the SVC inside `__syscall_cancel` itself, where the wrapper-class
14+
contract does not hold.
15+
16+
The patch redirects the *call site* (the `bl __syscall_cancel`
17+
instruction inside the wrapper function) instead of the SVC, so the
18+
fast-path executes with the wrapper's calling convention -- not the
19+
generic SVC trampoline's.
20+
21+
## What `__syscall_cancel` actually does (musl)
22+
23+
Static-musl's `__syscall_cancel(a, b, c, d, e, f, nr)` (see
24+
`musl/src/thread/__syscall_cancel.c`):
25+
26+
1. Loads `self->cancel` (the calling thread's pthread_cancel state).
27+
2. If a cancellation has been requested AND the thread is in
28+
`PTHREAD_CANCEL_ENABLE` AND the cancel type is asynchronous, it
29+
raises the cancel and never returns from the call.
30+
3. Otherwise it issues the syscall (the SVC that the SVC-site
31+
rewriter currently patches) and returns the raw kernel result
32+
(`-4095..-1` encodes errno; everything else is the return value).
33+
4. After the SVC, it re-checks the cancel state for deferred
34+
cancellation and may run the cancel handlers if a cancel arrived
35+
during the syscall.
36+
37+
The wrapper's calling convention on aarch64:
38+
39+
| Reg | Meaning |
40+
|---------|------------------------------------|
41+
| `x0..x5`| syscall args `a..f` |
42+
| `x6` | syscall number `nr` |
43+
| `x0` | return value (raw kernel result) |
44+
| `x7..x18` | caller-clobbered scratch |
45+
| `x19+` | callee-saved (preserved) |
46+
47+
The key facts for kbox:
48+
49+
- errno is *not* set inside `__syscall_cancel`; the libc wrapper
50+
(e.g. `open()`) inspects the return and sets errno.
51+
- The result register (`x0`) and the dispatch register (`x6`) are
52+
exactly what `kbox_syscall_rewrite_aarch64_dispatch` already
53+
consumes via the existing cancel trampoline
54+
(`kbox_syscall_rewrite_aarch64_cancel_entry` in `src/rewrite.c`).
55+
- All FD bookkeeping happens inside the dispatch path; the cancel
56+
fast path is just an alternate entry to the same dispatch and
57+
inherits the existing forwarder semantics unchanged.
58+
59+
## What we lose
60+
61+
Cancellation-point semantics. By bypassing `__syscall_cancel`, the
62+
fast path skips the pre- and post-syscall pthread cancel checks. For
63+
a single-threaded program no cancellation can ever be pending, so
64+
this is a no-op. For a multi-threaded program with a thread that has
65+
been `pthread_cancel`'d, the bypassed `open()` will not act as a
66+
cancellation point and will not raise the cancel until the next real
67+
cancellation point.
68+
69+
This is the only correctness gap. errno propagation, FD bookkeeping,
70+
register preservation, and unwind metadata are all unaffected (we
71+
return to `bl_pc + 4`, exactly where the BL would have returned, with
72+
`x0` set the same way).
73+
74+
## Gating policy
75+
76+
Two conditions must both hold at install time for a cancel-style BL
77+
site to be promoted. The gate is stored in
78+
`kbox_rewrite_runtime::cancel_promote_allowed`, computed once during
79+
`kbox_rewrite_runtime_install()` and consulted from
80+
`rewrite_runtime_should_patch_site()`.
81+
82+
### Condition 1: the binary is static
83+
84+
`launch->interp_elf == NULL && launch->interp_elf_len == 0`. For a
85+
dynamic binary, libc (and therefore the clone wrapper that
86+
`pthread_create` depends on) lives in an interpreter-loaded DSO that
87+
the main-ELF scan cannot see. A dynamic program could also `dlopen`
88+
a DSO that spins up threads at runtime, which is not detectable
89+
statically at all. Promotion for dynamic binaries is unsafe, so the
90+
gate rejects them outright. This also means the cancel-wrapper fast
91+
path only benefits static-musl binaries today; dynamic programs stay
92+
on the existing forwarder path.
93+
94+
### Condition 2: main_elf has no fork-family wrapper sites
95+
96+
`kbox_rewrite_has_wrapper_syscalls(main_elf, ..., {clone, fork,
97+
vfork, clone3})` returns 0. Because the binary is static (condition
98+
1), libc is *part of* `main_elf` — there is no separate interpreter
99+
ELF to scan. Any `pthread_create` → libc clone wrapper compiles
100+
down to a `mov x8, #220; svc 0` site inside `main_elf`'s text
101+
segment, and the wrapper-number scanner catches it. Scanning
102+
`main_elf` alone is therefore sufficient to cover the embedded libc
103+
in a static build; this is the important invariant that makes the
104+
gate sound.
105+
106+
Rationale: no fork-family sites in the main (= only) ELF implies
107+
the program cannot create additional threads, which implies pthread
108+
cancellation cannot be pending on any thread, which makes the
109+
cancel bypass a strict no-op.
110+
111+
### Conservative by design
112+
113+
This gate rejects multi-threaded programs that never actually call
114+
`pthread_cancel` — those would also be safe to promote, but proving
115+
it statically is unreliable. The gate costs nothing on the
116+
`bench-test` target (single-threaded) and trivially preserves
117+
correctness for everything else by leaving the existing forwarder
118+
path in place.
119+
120+
### Known residual limitations
121+
122+
- A static program that invokes `clone`/`clone3` via `syscall(3)`
123+
(register-indirect `mov x8, x_reg; svc 0`) slips past the
124+
wrapper-number scanner, which only matches the literal-immediate
125+
pattern `movz x8, #nr; svc 0`. The gate would approve such a
126+
binary even though it can actually create threads. This pattern
127+
is very rare in practice — no test binary in the tree exercises
128+
it, and musl's own `pthread_create` path uses the immediate form
129+
— but it is a known unsoundness that would need a stronger
130+
static analysis to close. Not introduced by this series; the
131+
underlying scanner predates it.
132+
133+
- The shared-libc musl `__syscall_cancel` calling convention
134+
differs from the static one (nr in `x0` vs `x6`). Even if
135+
condition 1 were relaxed, the current BL-site detector would
136+
not recognize dynamic-musl call sites. Out of scope for this
137+
fast path.
138+
139+
## Site detection
140+
141+
Pattern (aarch64, walk the segment one 4-byte instruction at a time):
142+
143+
```
144+
mov{z} x6, #imm16 ; syscall number into x6
145+
... within 32 bytes ...
146+
bl <target> ; opcode 0x94XXXXXX
147+
```
148+
149+
The intermediate instructions are arbitrary (typically arg setup and
150+
maybe a `mov x?, x6`). The 32-byte horizon is a safety bound to keep
151+
the heuristic local; in practice the BL is one or two instructions
152+
after the `mov`.
153+
154+
We deliberately do *not* match plain `b` (opcode `0x14XXXXXX`).
155+
A plain `b` would be a tail call, which has different return
156+
semantics: after the rewrite trampoline executes, control falls
157+
through to `b_pc + 4`, but a tail-call site has no meaningful "next
158+
instruction" -- the surrounding function expects never to come back.
159+
Restricting to BL sidesteps this entire class of misanalysis.
160+
161+
We also do not validate that the BL target is the actual
162+
`__syscall_cancel` symbol. Doing so would require a dynsym/symtab
163+
walk and would only work for symbol-bearing static-musl binaries.
164+
The `mov x6, #nr` constraint is already a strong structural filter
165+
(x6 is not used as an arg register by the kernel ABI on aarch64), and
166+
the single-thread gate makes any false positive a no-op anyway.
167+
168+
## Patch and trampoline
169+
170+
The patch is the same B-relative-to-trampoline encoding the rewriter
171+
already uses for SVC sites:
172+
173+
```
174+
[bl <target>] -> [b <trampoline_slot>]
175+
```
176+
177+
The trampoline slot (`AARCH64_REWRITE_SLOT_SIZE` = 32 bytes) is
178+
emitted by `write_aarch64_trampoline` with
179+
`wrapper_kind = SYSCALL_CANCEL`, which causes it to point at
180+
`kbox_syscall_rewrite_aarch64_cancel_entry` (versus the regular
181+
`kbox_syscall_rewrite_aarch64_entry` for SVC sites). The cancel
182+
entry differs from the regular entry in exactly one line: it loads
183+
`nr` from saved `x6` (offset `+40`) instead of saved `x8` (offset
184+
`+56`). Everything else -- register save/restore, the call into
185+
`kbox_syscall_rewrite_aarch64_dispatch`, and the resume sequence --
186+
is identical.
187+
188+
After dispatch, the cancel entry executes:
189+
190+
```
191+
add x16, x19, #4 ; x19 holds origin = bl_pc, so x16 = bl_pc + 4
192+
br x16
193+
```
194+
195+
This resumes the wrapper at the instruction after the BL, with `x0`
196+
holding the kernel result, which is exactly the state the wrapper
197+
expects after a normal return from `__syscall_cancel`.
198+
199+
`x30` (LR) is restored to whatever the BL site's caller had stored
200+
before the call. We do *not* update it to `bl_pc + 4` even though a
201+
real BL would have. This is fine: AAPCS64 treats x30 as
202+
caller-clobbered across a call, and the wrapper's prologue has
203+
already saved the function's own return address; nothing in the
204+
wrapper body reads x30 between the BL and the function epilogue.
205+
206+
If the BL is more than ±128 MiB from the trampoline page, the
207+
existing veneer fallback in `kbox_rewrite_runtime_install` bridges
208+
the gap exactly the same way it does for out-of-range SVC sites.
209+
210+
## Tests
211+
212+
Unit (`tests/unit/test-rewrite.c`):
213+
214+
- Encodes a synthetic `mov x6, #57; bl ...` (close) and a
215+
`mov x6, #56; bl ...` (openat) inside an aarch64 ELF segment;
216+
asserts `analyze_segment` emits a planned site at the BL with
217+
`width=4` and `original` matching the BL bytes.
218+
- Asserts `kbox_rewrite_encode_patch` for the BL site emits a
219+
4-byte B with `imm26` pointing at the trampoline.
220+
- Asserts that with `kbox_rewrite_has_fork_sites` true on the same
221+
ELF, `cancel_promote_allowed` would be 0 and the install path
222+
would skip the cancel-kind sites (validated through
223+
`rewrite_runtime_should_patch_site`).
224+
225+
Integration: bench-test under `--syscall-mode=rewrite` on lima
226+
(correctness) and on `arm` (perf delta on `open+close`).
227+
228+
## Performance baseline
229+
230+
Before (real Arm64, `bench-test 1000`, BUILD=release, from TODO.md):
231+
232+
| Syscall | rewrite |
233+
|------------|---------|
234+
| open+close | 47.3us |
235+
236+
Target: pull `open+close` into the same ~1.5 us tier as the other
237+
rewritten rows. The numbers will be re-captured on the same `arm`
238+
host with the same release build before/after this change and
239+
recorded in the changelog.

0 commit comments

Comments
 (0)