|
| 1 | +# cancel-wrapper fast path (aarch64) |
| 2 | + |
| 3 | +## What this is |
| 4 | + |
| 5 | +A fast path that intercepts calls to musl's `__syscall_cancel` from |
| 6 | +inside libc syscall wrappers and replaces them with a direct dispatch |
| 7 | +into kbox's rewrite trampoline. The win is concentrated on aarch64 |
| 8 | +`open+close`, which is the only remaining non-near-native row in the |
| 9 | +rewrite baseline (`47.3us` vs `~1.5us` for the other rewritten |
| 10 | +syscalls; see `TODO.md`). The reason it lags is that musl's `open()` |
| 11 | +wrapper does not call the kernel directly: it goes through |
| 12 | +`__syscall_cancel`, and the rewriter's per-SVC-site patcher only sees |
| 13 | +the SVC inside `__syscall_cancel` itself, where the wrapper-class |
| 14 | +contract does not hold. |
| 15 | + |
| 16 | +The patch redirects the *call site* (the `bl __syscall_cancel` |
| 17 | +instruction inside the wrapper function) instead of the SVC, so the |
| 18 | +fast-path executes with the wrapper's calling convention -- not the |
| 19 | +generic SVC trampoline's. |
| 20 | + |
| 21 | +## What `__syscall_cancel` actually does (musl) |
| 22 | + |
| 23 | +Static-musl's `__syscall_cancel(a, b, c, d, e, f, nr)` (see |
| 24 | +`musl/src/thread/__syscall_cancel.c`): |
| 25 | + |
| 26 | +1. Loads `self->cancel` (the calling thread's pthread_cancel state). |
| 27 | +2. If a cancellation has been requested AND the thread is in |
| 28 | + `PTHREAD_CANCEL_ENABLE` AND the cancel type is asynchronous, it |
| 29 | + raises the cancel and never returns from the call. |
| 30 | +3. Otherwise it issues the syscall (the SVC that the SVC-site |
| 31 | + rewriter currently patches) and returns the raw kernel result |
| 32 | + (`-4095..-1` encodes errno; everything else is the return value). |
| 33 | +4. After the SVC, it re-checks the cancel state for deferred |
| 34 | + cancellation and may run the cancel handlers if a cancel arrived |
| 35 | + during the syscall. |
| 36 | + |
| 37 | +The wrapper's calling convention on aarch64: |
| 38 | + |
| 39 | +| Reg | Meaning | |
| 40 | +|---------|------------------------------------| |
| 41 | +| `x0..x5`| syscall args `a..f` | |
| 42 | +| `x6` | syscall number `nr` | |
| 43 | +| `x0` | return value (raw kernel result) | |
| 44 | +| `x7..x18` | caller-clobbered scratch | |
| 45 | +| `x19+` | callee-saved (preserved) | |
| 46 | + |
| 47 | +The key facts for kbox: |
| 48 | + |
| 49 | +- errno is *not* set inside `__syscall_cancel`; the libc wrapper |
| 50 | + (e.g. `open()`) inspects the return and sets errno. |
| 51 | +- The result register (`x0`) and the dispatch register (`x6`) are |
| 52 | + exactly what `kbox_syscall_rewrite_aarch64_dispatch` already |
| 53 | + consumes via the existing cancel trampoline |
| 54 | + (`kbox_syscall_rewrite_aarch64_cancel_entry` in `src/rewrite.c`). |
| 55 | +- All FD bookkeeping happens inside the dispatch path; the cancel |
| 56 | + fast path is just an alternate entry to the same dispatch and |
| 57 | + inherits the existing forwarder semantics unchanged. |
| 58 | + |
| 59 | +## What we lose |
| 60 | + |
| 61 | +Cancellation-point semantics. By bypassing `__syscall_cancel`, the |
| 62 | +fast path skips the pre- and post-syscall pthread cancel checks. For |
| 63 | +a single-threaded program no cancellation can ever be pending, so |
| 64 | +this is a no-op. For a multi-threaded program with a thread that has |
| 65 | +been `pthread_cancel`'d, the bypassed `open()` will not act as a |
| 66 | +cancellation point and will not raise the cancel until the next real |
| 67 | +cancellation point. |
| 68 | + |
| 69 | +This is the only correctness gap. errno propagation, FD bookkeeping, |
| 70 | +register preservation, and unwind metadata are all unaffected (we |
| 71 | +return to `bl_pc + 4`, exactly where the BL would have returned, with |
| 72 | +`x0` set the same way). |
| 73 | + |
| 74 | +## Gating policy |
| 75 | + |
| 76 | +Two conditions must both hold at install time for a cancel-style BL |
| 77 | +site to be promoted. The gate is stored in |
| 78 | +`kbox_rewrite_runtime::cancel_promote_allowed`, computed once during |
| 79 | +`kbox_rewrite_runtime_install()` and consulted from |
| 80 | +`rewrite_runtime_should_patch_site()`. |
| 81 | + |
| 82 | +### Condition 1: the binary is static |
| 83 | + |
| 84 | +`launch->interp_elf == NULL && launch->interp_elf_len == 0`. For a |
| 85 | +dynamic binary, libc (and therefore the clone wrapper that |
| 86 | +`pthread_create` depends on) lives in an interpreter-loaded DSO that |
| 87 | +the main-ELF scan cannot see. A dynamic program could also `dlopen` |
| 88 | +a DSO that spins up threads at runtime, which is not detectable |
| 89 | +statically at all. Promotion for dynamic binaries is unsafe, so the |
| 90 | +gate rejects them outright. This also means the cancel-wrapper fast |
| 91 | +path only benefits static-musl binaries today; dynamic programs stay |
| 92 | +on the existing forwarder path. |
| 93 | + |
| 94 | +### Condition 2: main_elf has no fork-family wrapper sites |
| 95 | + |
| 96 | +`kbox_rewrite_has_wrapper_syscalls(main_elf, ..., {clone, fork, |
| 97 | +vfork, clone3})` returns 0. Because the binary is static (condition |
| 98 | +1), libc is *part of* `main_elf` — there is no separate interpreter |
| 99 | +ELF to scan. Any `pthread_create` → libc clone wrapper compiles |
| 100 | +down to a `mov x8, #220; svc 0` site inside `main_elf`'s text |
| 101 | +segment, and the wrapper-number scanner catches it. Scanning |
| 102 | +`main_elf` alone is therefore sufficient to cover the embedded libc |
| 103 | +in a static build; this is the important invariant that makes the |
| 104 | +gate sound. |
| 105 | + |
| 106 | +Rationale: no fork-family sites in the main (= only) ELF implies |
| 107 | +the program cannot create additional threads, which implies pthread |
| 108 | +cancellation cannot be pending on any thread, which makes the |
| 109 | +cancel bypass a strict no-op. |
| 110 | + |
| 111 | +### Conservative by design |
| 112 | + |
| 113 | +This gate rejects multi-threaded programs that never actually call |
| 114 | +`pthread_cancel` — those would also be safe to promote, but proving |
| 115 | +it statically is unreliable. The gate costs nothing on the |
| 116 | +`bench-test` target (single-threaded) and trivially preserves |
| 117 | +correctness for everything else by leaving the existing forwarder |
| 118 | +path in place. |
| 119 | + |
| 120 | +### Known residual limitations |
| 121 | + |
| 122 | +- A static program that invokes `clone`/`clone3` via `syscall(3)` |
| 123 | + (register-indirect `mov x8, x_reg; svc 0`) slips past the |
| 124 | + wrapper-number scanner, which only matches the literal-immediate |
| 125 | + pattern `movz x8, #nr; svc 0`. The gate would approve such a |
| 126 | + binary even though it can actually create threads. This pattern |
| 127 | + is very rare in practice — no test binary in the tree exercises |
| 128 | + it, and musl's own `pthread_create` path uses the immediate form |
| 129 | + — but it is a known unsoundness that would need a stronger |
| 130 | + static analysis to close. Not introduced by this series; the |
| 131 | + underlying scanner predates it. |
| 132 | + |
| 133 | +- The shared-libc musl `__syscall_cancel` calling convention |
| 134 | + differs from the static one (nr in `x0` vs `x6`). Even if |
| 135 | + condition 1 were relaxed, the current BL-site detector would |
| 136 | + not recognize dynamic-musl call sites. Out of scope for this |
| 137 | + fast path. |
| 138 | + |
| 139 | +## Site detection |
| 140 | + |
| 141 | +Pattern (aarch64, walk the segment one 4-byte instruction at a time): |
| 142 | + |
| 143 | +``` |
| 144 | +mov{z} x6, #imm16 ; syscall number into x6 |
| 145 | +... within 32 bytes ... |
| 146 | +bl <target> ; opcode 0x94XXXXXX |
| 147 | +``` |
| 148 | + |
| 149 | +The intermediate instructions are arbitrary (typically arg setup and |
| 150 | +maybe a `mov x?, x6`). The 32-byte horizon is a safety bound to keep |
| 151 | +the heuristic local; in practice the BL is one or two instructions |
| 152 | +after the `mov`. |
| 153 | + |
| 154 | +We deliberately do *not* match plain `b` (opcode `0x14XXXXXX`). |
| 155 | +A plain `b` would be a tail call, which has different return |
| 156 | +semantics: after the rewrite trampoline executes, control falls |
| 157 | +through to `b_pc + 4`, but a tail-call site has no meaningful "next |
| 158 | +instruction" -- the surrounding function expects never to come back. |
| 159 | +Restricting to BL sidesteps this entire class of misanalysis. |
| 160 | + |
| 161 | +We also do not validate that the BL target is the actual |
| 162 | +`__syscall_cancel` symbol. Doing so would require a dynsym/symtab |
| 163 | +walk and would only work for symbol-bearing static-musl binaries. |
| 164 | +The `mov x6, #nr` constraint is already a strong structural filter |
| 165 | +(x6 is not used as an arg register by the kernel ABI on aarch64), and |
| 166 | +the single-thread gate makes any false positive a no-op anyway. |
| 167 | + |
| 168 | +## Patch and trampoline |
| 169 | + |
| 170 | +The patch is the same B-relative-to-trampoline encoding the rewriter |
| 171 | +already uses for SVC sites: |
| 172 | + |
| 173 | +``` |
| 174 | +[bl <target>] -> [b <trampoline_slot>] |
| 175 | +``` |
| 176 | + |
| 177 | +The trampoline slot (`AARCH64_REWRITE_SLOT_SIZE` = 32 bytes) is |
| 178 | +emitted by `write_aarch64_trampoline` with |
| 179 | +`wrapper_kind = SYSCALL_CANCEL`, which causes it to point at |
| 180 | +`kbox_syscall_rewrite_aarch64_cancel_entry` (versus the regular |
| 181 | +`kbox_syscall_rewrite_aarch64_entry` for SVC sites). The cancel |
| 182 | +entry differs from the regular entry in exactly one line: it loads |
| 183 | +`nr` from saved `x6` (offset `+40`) instead of saved `x8` (offset |
| 184 | +`+56`). Everything else -- register save/restore, the call into |
| 185 | +`kbox_syscall_rewrite_aarch64_dispatch`, and the resume sequence -- |
| 186 | +is identical. |
| 187 | + |
| 188 | +After dispatch, the cancel entry executes: |
| 189 | + |
| 190 | +``` |
| 191 | +add x16, x19, #4 ; x19 holds origin = bl_pc, so x16 = bl_pc + 4 |
| 192 | +br x16 |
| 193 | +``` |
| 194 | + |
| 195 | +This resumes the wrapper at the instruction after the BL, with `x0` |
| 196 | +holding the kernel result, which is exactly the state the wrapper |
| 197 | +expects after a normal return from `__syscall_cancel`. |
| 198 | + |
| 199 | +`x30` (LR) is restored to whatever the BL site's caller had stored |
| 200 | +before the call. We do *not* update it to `bl_pc + 4` even though a |
| 201 | +real BL would have. This is fine: AAPCS64 treats x30 as |
| 202 | +caller-clobbered across a call, and the wrapper's prologue has |
| 203 | +already saved the function's own return address; nothing in the |
| 204 | +wrapper body reads x30 between the BL and the function epilogue. |
| 205 | + |
| 206 | +If the BL is more than ±128 MiB from the trampoline page, the |
| 207 | +existing veneer fallback in `kbox_rewrite_runtime_install` bridges |
| 208 | +the gap exactly the same way it does for out-of-range SVC sites. |
| 209 | + |
| 210 | +## Tests |
| 211 | + |
| 212 | +Unit (`tests/unit/test-rewrite.c`): |
| 213 | + |
| 214 | +- Encodes a synthetic `mov x6, #57; bl ...` (close) and a |
| 215 | + `mov x6, #56; bl ...` (openat) inside an aarch64 ELF segment; |
| 216 | + asserts `analyze_segment` emits a planned site at the BL with |
| 217 | + `width=4` and `original` matching the BL bytes. |
| 218 | +- Asserts `kbox_rewrite_encode_patch` for the BL site emits a |
| 219 | + 4-byte B with `imm26` pointing at the trampoline. |
| 220 | +- Asserts that with `kbox_rewrite_has_fork_sites` true on the same |
| 221 | + ELF, `cancel_promote_allowed` would be 0 and the install path |
| 222 | + would skip the cancel-kind sites (validated through |
| 223 | + `rewrite_runtime_should_patch_site`). |
| 224 | + |
| 225 | +Integration: bench-test under `--syscall-mode=rewrite` on lima |
| 226 | +(correctness) and on `arm` (perf delta on `open+close`). |
| 227 | + |
| 228 | +## Performance baseline |
| 229 | + |
| 230 | +Before (real Arm64, `bench-test 1000`, BUILD=release, from TODO.md): |
| 231 | + |
| 232 | +| Syscall | rewrite | |
| 233 | +|------------|---------| |
| 234 | +| open+close | 47.3us | |
| 235 | + |
| 236 | +Target: pull `open+close` into the same ~1.5 us tier as the other |
| 237 | +rewritten rows. The numbers will be re-captured on the same `arm` |
| 238 | +host with the same release build before/after this change and |
| 239 | +recorded in the changelog. |
0 commit comments