Skip to content

Conversation

qwerasd205
Copy link
Member

I used the new CPU counter mode in Instruments.app to track down functions that had instruction delivery bottlenecks (indicating i-cache misses) and picked a bunch of trivial functions to mark as inline (plus a couple that are only used once or twice and which benefit from inlining).

The size of macos-arm64/libghostty-fat.a built with zig build -Doptimize=ReleaseFast -Dxcframework-target=native goes from 145,538,856 bytes on main to 145,595,952 on this branch, a negligible increase.

These changes resulted in some pretty sizable improvements in vtebench results on my machine (Apple M3 Max):
image

With this, the only vtebench test we're slower than Alacritty in (on my machine, at 130x51 window size) is dense_cells (which, IMO, is so artificial that optimizing for it might actually negatively impact real world performance).

I also did a pretty simple improvement to how we copy the screen in the renderer, gave it its own page pool for less memory churn. Further optimization in that area should be explored since in some scenarios it seems like as much as 35% of the time on the io-reader thread is spent waiting for the lock.

Note

Before this is merged, someone really ought to test this on an x86 processor to see how the performance compares there, since this is tuning for my processor specifically, and I know that M chips have pretty big i-cache compared to some x86 processors which could impact the performance characteristics of these changes.

Changes it so that the renderer retains its own MemoryPool for PageList
pages so that new pages rarely need to be allocated when cloning the
screen. Also switches to using an arena allocator in `updateFrame` to
avoid having to deinit the cloned screen since instead we can just throw
out the memory.
Supported by benchmarks (vtebench on Apple M3 Max)
Supported by benchmarks (vtebench on Apple M3 Max)
A whole bunch of inline annotations, some of these were tracked down
with Instruments.app, others are guesses / just seemed right because
they were trivial wrapper functions.

Regardless, these changes are ultimately supported by improved vtebench
results on my machine (Apple M3 Max).
@qwerasd205 qwerasd205 requested review from a team as code owners September 29, 2025 03:47
@qwerasd205
Copy link
Member Author

qwerasd205 commented Sep 29, 2025

Oh, test failures. I totally neglected to test before PRing. I'll look in to that tomorrow.

Edit: They're very confusing errors actually-- is it some sort of miscompile? It's seemingly caused by Parser.doAction being inline and then Parser.next being called with a comptime-known 0x1B...

Very weird failures, not 100% sure of the cause; regardless, this fixes
them.
@qwerasd205 qwerasd205 force-pushed the inline-all-the-things branch from 7f6737b to a06d32b Compare September 29, 2025 17:18
@mitchellh
Copy link
Contributor

I can run the vtebench marks at home when I have my x86 machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants