Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I used the new CPU counter mode in Instruments.app to track down functions that had instruction delivery bottlenecks (indicating i-cache misses) and picked a bunch of trivial functions to mark as inline (plus a couple that are only used once or twice and which benefit from inlining).
The size of
macos-arm64/libghostty-fat.a
built withzig build -Doptimize=ReleaseFast -Dxcframework-target=native
goes from145,538,856
bytes onmain
to145,595,952
on this branch, a negligible increase.These changes resulted in some pretty sizable improvements in vtebench results on my machine (Apple M3 Max):

With this, the only vtebench test we're slower than Alacritty in (on my machine, at 130x51 window size) is
dense_cells
(which, IMO, is so artificial that optimizing for it might actually negatively impact real world performance).I also did a pretty simple improvement to how we copy the screen in the renderer, gave it its own page pool for less memory churn. Further optimization in that area should be explored since in some scenarios it seems like as much as 35% of the time on the
io-reader
thread is spent waiting for the lock.Note
Before this is merged, someone really ought to test this on an x86 processor to see how the performance compares there, since this is tuning for my processor specifically, and I know that M chips have pretty big i-cache compared to some x86 processors which could impact the performance characteristics of these changes.