perf(tarball): tee integrity hash on the download stream#278
perf(tarball): tee integrity hash on the download stream#278
Conversation
There was a problem hiding this comment.
Pull request overview
This PR optimizes the tarball download/extract pipeline by streaming the HTTP response through gzip while concurrently hashing, and by hashing each tar entry as it’s read to eliminate redundant passes over bytes. It also updates agent guidelines and stabilizes a couple of XDG-related npmrc tests under developer environments.
Changes:
- Stream tarball HTTP bytes through
IntegrityChecker+async_compressiongzip decoding to avoid buffering + double-hashing compressed bytes. - Interleave per-entry SHA-512 while reading tar entries and add
StoreDir::write_cas_file_prehashedto avoid re-hashing full buffers. - Update docs (AGENTS.md) and adjust npmrc tests to avoid
PNPM_HOMEshadowing XDG behavior.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/tarball/src/lib.rs | Replace buffer-then-inflate with streaming gzip decode + integrity tee; hash tar entries while reading |
| crates/tarball/Cargo.toml | Add deps needed for streaming decode and hashing (async-compression, futures-util, sha2) |
| crates/store-dir/src/cas_file.rs | Add write_cas_file_prehashed and refactor write_cas_file to delegate |
| crates/npmrc/src/lib.rs | Clear PNPM_HOME in XDG test to avoid ambient env shadowing |
| crates/npmrc/src/custom_deserializer.rs | Clear PNPM_HOME in XDG test to avoid ambient env shadowing |
| Cargo.toml | Add async-compression, enable reqwest stream feature, remove zune-inflate |
| Cargo.lock | Lockfile updates for new/removed deps |
| AGENTS.md | Document “internal performance divergence is allowed” guidance |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #278 +/- ##
==========================================
- Coverage 89.67% 89.28% -0.40%
==========================================
Files 61 61
Lines 5074 5327 +253
==========================================
+ Hits 4550 4756 +206
- Misses 524 571 +47 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Micro-Benchmark ResultsLinux |
Three comments, all actionable: * `write_cas_file_prehashed` silently corrupts the CAFS if a caller passes a hash that isn't `Sha512(buffer)` — the blob lands at a path whose filename advertises a digest its contents don't satisfy. Release builds trust the caller so the single-pass hot path stays fast; `debug_assert_eq!(Sha512::digest(buffer), file_hash)` catches the misuse in tests and debug builds. * The chunked read loop inside `extract_tarball_entries` extended `buffer` via `extend_from_slice`, which falls back to `Vec::reserve` and aborts on allocation failure. The initial `try_reserve( prealloc_hint)` was the whole point of the OOM-safety guard — the hint is clamped to 64 MiB even when the tar header claims more, so an entry whose real size exceeds the clamp MUST keep allocating without aborting. Call `buffer.try_reserve(n)` before each extend and propagate failure as `TarballError::ReadTarballEntries` (matching the prealloc error shape). * The streaming download bailed on the first `decoder.write_all` error, returning `DecodeGzip` for a corrupt-gzip-with-mismatched- integrity tarball. pnpm's upstream pipeline (`worker/src/start.ts` → `crypto.hash` then `gunzipSync`) hashes the compressed bytes before touching zlib, so the same input surfaces as an integrity failure there — and error codes are part of the public contract per AGENTS.md "errors and diagnostics". Keep feeding the `IntegrityChecker` with every chunk regardless of decoder state, then run `checker.result()` first. Only after integrity passes do we surface a stored decode error (or call `decoder.shutdown` to finalize a cut-short footer).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Copilot follow-up review on #278. `package_unpacked_size` comes from registry metadata — untrusted — and flowed straight into `Vec::with_capacity(size)`. A malicious or corrupt `unpackedSize` of `usize::MAX` would turn the prealloc into an immediate abort before the first byte of the tarball even arrived, no I/O error surfaceable. Treat the hint as best-effort: clamp it to the same 64 MiB ceiling used for `MAX_ENTRY_PREALLOC_BYTES` and reserve fallibly with `try_reserve`, dropping the reservation on allocation failure so the `Vec` simply falls back to geometric growth. Tarballs whose real decompressed size exceeds the clamp still work, they just don't get the prealloc benefit. Note: real-world decompressed tarballs are under 64 MiB for > 99% of npm packages (typescript at ~60 MiB sits near the top). The per- tarball peak memory under concurrency = post-download-semaphore cap × 64 MiB already bounds our in-flight RSS; this cap keeps that bound from inflating on a bad manifest.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Integrated-Benchmark Report (Linux)Scenario: Frozen Lockfile
BENCHMARK_REPORT.json{
"results": [
{
"command": "pacquet@HEAD",
"mean": 3.0612074172,
"stddev": 0.07254709458376475,
"median": 3.0468951139,
"user": 2.46729384,
"system": 3.4732257399999993,
"min": 2.9610352284,
"max": 3.1894596804,
"times": [
3.1014075294,
2.9840049164,
3.1237359844,
3.1894596804,
3.0590186764,
3.0347715514,
3.0074028484,
3.1275060804,
3.0237316764,
2.9610352284
],
"exit_codes": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
},
{
"command": "pacquet@main",
"mean": 2.9799968133,
"stddev": 0.08831080976453795,
"median": 2.9780374749,
"user": 2.47132464,
"system": 3.43976314,
"min": 2.8299090384,
"max": 3.1256024644,
"times": [
3.0483820684,
2.9789185614,
2.9307676064,
2.9187984764,
2.9771563884,
3.0882474604,
3.1256024644,
2.9843966514,
2.8299090384,
2.9177894174
],
"exit_codes": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
},
{
"command": "pnpm",
"mean": 6.7077239796,
"stddev": 0.07541291153144425,
"median": 6.6887728799,
"user": 9.28145574,
"system": 4.581847639999999,
"min": 6.6099420884,
"max": 6.8111099794,
"times": [
6.6818056094000005,
6.664050806400001,
6.6519270294,
6.695740150400001,
6.809544804400001,
6.7139850754,
6.634682557400001,
6.8111099794,
6.6099420884,
6.8044516954
],
"exit_codes": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
]
}
]
} |
Four comments, all actionable:
* `write_cas_file` now routes through a private
`write_cas_file_unchecked` instead of `write_cas_file_prehashed`.
The only change for the pnpm-facing API is internal: before this
commit, every `write_cas_file` call on a debug/test build would
re-hash its own buffer inside the `debug_assert_eq!` of
`write_cas_file_prehashed`. That's cheap per-call but on an install
hot path across a whole dep tree it showed up as measurable debug-
build overhead. The unchecked helper skips the assert; the public
`write_cas_file_prehashed` keeps it for the external-caller-with-
its-own-hash case.
* Add two unit tests for `write_cas_file_prehashed`:
- `write_cas_file_prehashed_parity_with_write_cas_file`: same bytes
through both entry points land at the same CAFS path and produce
the same on-disk contents, for both `executable=false` and
`executable=true`.
- `write_cas_file_prehashed_debug_asserts_hash_matches_buffer`:
`#[should_panic]` locks in the foot-gun guard; gated on
`debug_assertions` because the assert is compiled out in release.
* Replace the stale TODO above `package_integrity.to_string()`.
The old comment talked about "Cloning here is less than desirable"
and suggested `Arc::clone`, but after the streaming refactor this
site no longer clones the `Integrity`; it serializes it to a
`String` for the `index.db` key. The new comment reflects what the
allocation is actually for and what would cut it (an
`Arc<str>` field on `DownloadTarballToStore`) if a profile ever
demanded it.
* Make tokio's `io-util` feature explicit at the workspace level.
`AsyncWriteExt` / `AsyncReadExt` are feature-gated; we were
getting them via whatever transitive crate happened to enable
them (`reqwest`'s `stream` feature, most likely). That's fragile
across dep-graph changes. Now declared as a first-class feature
on the workspace tokio line alongside `rt`, `rt-multi-thread`,
and `macros`.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // the compressed bytes into an SHA-512 `IntegrityChecker`. | ||
| // Compared to the old "buffer-everything, hash, decompress" | ||
| // pipeline this removes one full pass over the compressed body | ||
| // (no separate `Integrity::check(&response)` call), never holds | ||
| // both the compressed and decompressed buffers at the same | ||
| // time, and lets decompression overlap with `await`s on the | ||
| // network. The blocking tar iterator still runs inside | ||
| // `spawn_blocking` below so per-file SHA-512 + CAFS writes | ||
| // stay off the reactor. |
There was a problem hiding this comment.
Grammar/accuracy nit in the comment: “teeing the compressed bytes into an SHA-512 IntegrityChecker” should be “a …”, and IntegrityChecker isn’t inherently SHA-512 (it follows the algorithms present in Integrity). Consider rewording to avoid implying a fixed hash algorithm.
| // the compressed bytes into an SHA-512 `IntegrityChecker`. | |
| // Compared to the old "buffer-everything, hash, decompress" | |
| // pipeline this removes one full pass over the compressed body | |
| // (no separate `Integrity::check(&response)` call), never holds | |
| // both the compressed and decompressed buffers at the same | |
| // time, and lets decompression overlap with `await`s on the | |
| // network. The blocking tar iterator still runs inside | |
| // `spawn_blocking` below so per-file SHA-512 + CAFS writes | |
| // stay off the reactor. | |
| // the compressed bytes into an `IntegrityChecker` for the | |
| // expected integrity algorithms. Compared to the old | |
| // "buffer-everything, hash, decompress" pipeline this removes | |
| // one full pass over the compressed body (no separate | |
| // `Integrity::check(&response)` call), never holds both the | |
| // compressed and decompressed buffers at the same time, and | |
| // lets decompression overlap with `await`s on the network. The | |
| // blocking tar iterator still runs inside `spawn_blocking` | |
| // below so per-file SHA-512 + CAFS writes stay off the reactor. |
| /// Same as [`write_cas_file`](Self::write_cas_file) but takes a | ||
| /// pre-computed SHA-512 so the caller can hash the bytes as they | ||
| /// arrive (e.g. chunk-by-chunk out of a tar entry) and skip the | ||
| /// separate `Sha512::digest(buffer)` pass inside this function. | ||
| /// | ||
| /// The caller is responsible for ensuring `file_hash` is the | ||
| /// SHA-512 of `buffer`; passing a mismatched hash would corrupt | ||
| /// the CAFS (the blob would land at a path that advertises a | ||
| /// digest its contents don't satisfy). A `debug_assert` catches | ||
| /// that misuse in debug builds / tests; release builds trust the | ||
| /// caller so the write path stays single-pass. | ||
| pub fn write_cas_file_prehashed( | ||
| &self, | ||
| buffer: &[u8], | ||
| file_hash: FileHash, | ||
| executable: bool, | ||
| ) -> Result<PathBuf, WriteCasFileError> { | ||
| debug_assert_eq!( | ||
| Sha512::digest(buffer), | ||
| file_hash, | ||
| "write_cas_file_prehashed called with a hash that is not Sha512(buffer); \ | ||
| passing a mismatched hash would silently corrupt the CAFS", | ||
| ); | ||
| self.write_cas_file_unchecked(buffer, file_hash, executable) |
There was a problem hiding this comment.
write_cas_file_prehashed is safe to call but relies on the caller to provide the correct hash (the guard is only a debug_assert, so release builds will trust the input). Since a mismatch would place bytes under the wrong digest path, consider making the “trusted/unchecked” nature more explicit in the API (e.g., naming) to reduce accidental misuse by other internal callers.
|
|
Update: the streaming pipeline has been reverted in bab8b40. The PR now keeps only the two changes that unambiguously help. What's left in the PR
Compressed bytes are still buffered fully before decompression; Why I dropped the async gzip pipelineToxiproxy-injected latency runs (integrated-benchmark frozen-lockfile, 1352-snapshot fixture, local M3):
The async-compression version forced the gzip backend to The narrow revert recovers parity with main on the 50 ms run (within noise) while keeping the two cheap wins. A block comment at the streaming site documents these numbers so the next person who reaches for Follow-upA separate small PR will teach |
Clarify the scope of the cardinal "match pnpm exactly" rule: it governs observable behavior (CLI flags, defaults, error codes, `.npmrc` semantics, lockfile format, CAS/store-index layout) and not internal implementation detail. Pipeline topology, async vs. sync, streaming vs. buffering, hash-tee vs. post-hoc hashing, thread-pool shape, chunk sizes, and parallelism strategy may diverge when the divergence delivers a measured perf win and leaves observable outputs byte-identical. Require any such divergence to carry a `Why we diverge:` note so it stays auditable and reversible if upstream later adopts a different approach.
Two perf fixes from investigations/pacquet-macos-perf.md that both collapse redundant passes over the tarball bytes. 1. Stream the HTTP body through the gzip decoder (doc fix #4). Replace `response.bytes().await` + `decompress_gzip` (sync zune-inflate) + separate `Integrity::check(&buffer)` with a streaming pipeline: `bytes_stream()` teed through `ssri::IntegrityChecker` into `async_compression::tokio::write:: GzipDecoder`. This removes one full pass over the compressed body (no separate `.check()`), never holds compressed + decompressed buffers at the same time, and lets decompression overlap with network awaits. Integrity is still verified end-to-end before any CAFS blob is written, so a mismatch aborts the install with the same `TarballError::Checksum` behavior as before. 2. Interleave SHA-512 with the per-entry read (doc fix #2). Add `StoreDir::write_cas_file_prehashed`; the tar extraction loop reads + hashes 64 KiB chunks in one pass instead of `read_to_end` followed by a separate `Sha512::digest(buffer)` inside `write_cas_file`. The buffer still has to exist (CAFS filename is hash-derived), but two passes over the decompressed entry bytes collapse into one. Mirrors pnpm's `parseTarball` + `addBufferToCafs` in `store/cafs/src/addFilesFromTarball.ts`. Why we diverge from pnpm on (1): upstream `fetching/tarball-fetcher/src/remoteTarballFetcher.ts` + `worker/src/start.ts` + `store/cafs/src/addFilesFromTarball.ts` buffers the whole tarball into a Node `SharedArrayBuffer` and runs `crypto.hash` + `zlib.gunzipSync` sequentially. Observable behavior here is identical — same integrity errors on mismatch, same decompressed bytes, same CAFS layout, same `index.db` rows — so per the "Internal performance divergence is allowed" clause added in the preceding commit, the streaming topology is a pacquet-specific implementation-detail optimization rather than a shape port. Dep changes: swap `zune-inflate` for `async-compression` (gzip + tokio features) + `futures-util`; enable reqwest's `stream` feature; add `sha2` to the tarball crate. Tests: existing `packages_under_orgs_should_work` (real-network end-to-end) and `should_throw_error_on_checksum_mismatch` both pass unchanged, proving the streaming pipeline produces the same decompressed bytes and fails integrity checks the same way.
Three comments, all actionable: * `write_cas_file_prehashed` silently corrupts the CAFS if a caller passes a hash that isn't `Sha512(buffer)` — the blob lands at a path whose filename advertises a digest its contents don't satisfy. Release builds trust the caller so the single-pass hot path stays fast; `debug_assert_eq!(Sha512::digest(buffer), file_hash)` catches the misuse in tests and debug builds. * The chunked read loop inside `extract_tarball_entries` extended `buffer` via `extend_from_slice`, which falls back to `Vec::reserve` and aborts on allocation failure. The initial `try_reserve( prealloc_hint)` was the whole point of the OOM-safety guard — the hint is clamped to 64 MiB even when the tar header claims more, so an entry whose real size exceeds the clamp MUST keep allocating without aborting. Call `buffer.try_reserve(n)` before each extend and propagate failure as `TarballError::ReadTarballEntries` (matching the prealloc error shape). * The streaming download bailed on the first `decoder.write_all` error, returning `DecodeGzip` for a corrupt-gzip-with-mismatched- integrity tarball. pnpm's upstream pipeline (`worker/src/start.ts` → `crypto.hash` then `gunzipSync`) hashes the compressed bytes before touching zlib, so the same input surfaces as an integrity failure there — and error codes are part of the public contract per AGENTS.md "errors and diagnostics". Keep feeding the `IntegrityChecker` with every chunk regardless of decoder state, then run `checker.result()` first. Only after integrity passes do we surface a stored decode error (or call `decoder.shutdown` to finalize a cut-short footer).
Copilot follow-up review on #278. `package_unpacked_size` comes from registry metadata — untrusted — and flowed straight into `Vec::with_capacity(size)`. A malicious or corrupt `unpackedSize` of `usize::MAX` would turn the prealloc into an immediate abort before the first byte of the tarball even arrived, no I/O error surfaceable. Treat the hint as best-effort: clamp it to the same 64 MiB ceiling used for `MAX_ENTRY_PREALLOC_BYTES` and reserve fallibly with `try_reserve`, dropping the reservation on allocation failure so the `Vec` simply falls back to geometric growth. Tarballs whose real decompressed size exceeds the clamp still work, they just don't get the prealloc benefit. Note: real-world decompressed tarballs are under 64 MiB for > 99% of npm packages (typescript at ~60 MiB sits near the top). The per- tarball peak memory under concurrency = post-download-semaphore cap × 64 MiB already bounds our in-flight RSS; this cap keeps that bound from inflating on a bad manifest.
Four comments, all actionable:
* `write_cas_file` now routes through a private
`write_cas_file_unchecked` instead of `write_cas_file_prehashed`.
The only change for the pnpm-facing API is internal: before this
commit, every `write_cas_file` call on a debug/test build would
re-hash its own buffer inside the `debug_assert_eq!` of
`write_cas_file_prehashed`. That's cheap per-call but on an install
hot path across a whole dep tree it showed up as measurable debug-
build overhead. The unchecked helper skips the assert; the public
`write_cas_file_prehashed` keeps it for the external-caller-with-
its-own-hash case.
* Add two unit tests for `write_cas_file_prehashed`:
- `write_cas_file_prehashed_parity_with_write_cas_file`: same bytes
through both entry points land at the same CAFS path and produce
the same on-disk contents, for both `executable=false` and
`executable=true`.
- `write_cas_file_prehashed_debug_asserts_hash_matches_buffer`:
`#[should_panic]` locks in the foot-gun guard; gated on
`debug_assertions` because the assert is compiled out in release.
* Replace the stale TODO above `package_integrity.to_string()`.
The old comment talked about "Cloning here is less than desirable"
and suggested `Arc::clone`, but after the streaming refactor this
site no longer clones the `Integrity`; it serializes it to a
`String` for the `index.db` key. The new comment reflects what the
allocation is actually for and what would cut it (an
`Arc<str>` field on `DownloadTarballToStore`) if a profile ever
demanded it.
* Make tokio's `io-util` feature explicit at the workspace level.
`AsyncWriteExt` / `AsyncReadExt` are feature-gated; we were
getting them via whatever transitive crate happened to enable
them (`reqwest`'s `stream` feature, most likely). That's fragile
across dep-graph changes. Now declared as a first-class feature
on the workspace tokio line alongside `rt`, `rt-multi-thread`,
and `macros`.
An earlier commit in this PR routed the response body through
`async_compression::tokio::write::GzipDecoder` so network fetch and
gzip inflation could overlap. That forced the gzip backend to
`flate2`/`miniz_oxide` (the only gzip feature async-compression
offers) and turned out to regress wall-time at every latency tier
I could reproduce locally.
Toxiproxy-injected latency, integrated-benchmark frozen-lockfile
scenario, 1352-snapshot fixture, M3 / APFS:
0 ms (loopback) main 2.60 s vs streaming 2.98 s (15% slower)
50 ms (typical) main 16.70 s vs streaming 18.94 s (13% slower)
200 ms (WAN) main 22.37 s vs streaming 22.58 s (break-even)
The overlap gain does scale with latency, but it never overtakes
the decompressor regression at any realistic latency — `miniz_oxide`
is slow enough on the decompress side that matching pnpm upstream's
"buffer body, then `gunzipSync`" topology wins on wall-time with the
current decompressor set.
Revert to:
* `bytes_stream()` drained chunk-by-chunk so we can tee each chunk
through the `IntegrityChecker` — this is the one piece of the
streaming pipeline that pays for itself. Saves one full pass over
the compressed body vs the pre-PR `response.bytes().await` +
`package_integrity.check(&response)`.
* Buffer the compressed bytes fully, then decompress with
`zune-inflate` sync inside `spawn_blocking`, exactly as the
pre-PR code did.
* Integrity is verified before we touch gzip, matching pnpm's
`crypto.hash` → `gunzipSync` ordering.
Keeps the per-entry hash interleave from the earlier commit (that
one is unambiguously a win and scales with file count, not latency).
Removes `async-compression` (workspace + crate dep) and tokio's
`io-util` feature (no other caller in the workspace); restores
`zune-inflate` as the decompressor.
Code change includes a block comment at the streaming site that
cites these numbers so the next person who reaches for
async-compression doesn't have to re-derive the regression.
Hyperfine runs its measured commands sequentially (no per-iteration interleaving). With `--warmup 1`, the second-listed command's measured runs always start from a state that the first command's 10 measured runs already warmed — OS page cache, verdaccio's on-disk cache, registry-mock in-memory state, inode cache, everything. This shows up as a systematic 1-3% advantage for whichever command is listed second in every PR's benchmark run, regardless of the code difference. #278's CI surfaced this clearly: three successive rebases all reported main slightly ahead of the PR HEAD (1.03× → 1.15× range depending on how much real signal was in the change), and it replicated even for the current commit where the diff is tiny. The workflow passes `HEAD main`, so `main` is always the second command, and `main` always appears slightly faster. Fix it by running each benchmarked install once in a `prewarm` phase before hyperfine starts. After prewarm, OS + verdaccio + registry-mock are all fully primed, and hyperfine's `--warmup 1` per command is just the final touch rather than the sole source of state warmup. Both commands' measured runs then start from the same warm state, and any remaining difference reflects the code under test. Cost: one extra install per revision (plus pnpm if `--with-pnpm`). At ~2.6s each on CI, that's ≤10s of added bench time for a three-way compare — cheap for reliable numbers. Refactors `cleanup_command()` out of `benchmark()` so `prewarm()` and `benchmark()` share the exact same rm-rf between iterations, keeping the state machine symmetric.
Temporary diagnostic for #278. The last two CI bench runs showed `main` ~3-4% ahead of `HEAD`, but the prewarm pass from the previous commit should have eliminated cold-cache ordering bias — yet the ratio was essentially unchanged. That leaves two competing hypotheses: * systematic benchmark-framework bias favouring the second-listed hyperfine command (CPU scheduling, measurement overhead, runner variance), or * a real ~3-4% regression in the remaining PR changes. Flip the argument order from `HEAD main` to `main HEAD` so `main` is now the first-listed command. If the next CI run shows `HEAD` winning by a similar margin, it's framework bias and we'll follow up with a proper randomization. If `main` still wins, the PR has real overhead and we revert the per-entry hash interleave. Revert this diff once the signal is unambiguous.
Fix #2 from `investigations/pacquet-macos-perf.md` turned out to regress wall-time at 0ms loopback by ~2% across three successive CI runs on #278, even after controlling for hyperfine-ordering bias (the previous commit swapped `HEAD main` → `main HEAD`; main stayed ahead by 2%, so the bias hypothesis explains only ~half the gap). Fix #2 is the most likely culprit — it hits every tar entry (tens of thousands across the 1352-snapshot fixture) and replaces an optimized `Vec::read_to_end` + single `Sha512::digest` with a chunked read loop that does per-chunk `try_reserve`, `hasher.update`, and `extend_from_slice`. The streaming per-chunk branches add overhead that the saved second pass doesn't recover at the scale of small npm entries (typical entry body is a few KiB — way below where a second scan of the already-hot cache starts mattering). Revert to the pre-PR shape on the entry loop: let mut buffer = Vec::new(); buffer.try_reserve(prealloc_hint).map_err(...)?; entry.read_to_end(&mut buffer).map_err(...)?; let (file_path, file_hash) = store_dir .write_cas_file(&buffer, file_is_executable) .map_err(...)?; Remove `write_cas_file_prehashed` and the private `write_cas_file_unchecked` helper from `StoreDir` — nothing calls them now. Also drop the two prehashed-specific tests (`…_parity_with_write_cas_file` and `…_debug_asserts_hash_matches_buffer`) and the tarball crate's `sha2` dep, since the one-shot `write_cas_file` on `pacquet-store-dir` is the only consumer of SHA-512 once more. What's still left in the PR after this revert: * compressed-side tee-hash: `bytes_stream()` + per-chunk `IntegrityChecker::input` eliminates the separate `Integrity::check(&buffer)` pass after buffering. Chunk count here is tiny (a handful per tarball) so per-chunk overhead is negligible, and the payoff (skipping one full SHA-512 over the compressed body) scales with tarball size. * integrity check ordering before decompression, matching pnpm's `crypto.hash` → `gunzipSync` sequence. * AGENTS.md's "internal perf divergence is allowed" clause. Next CI bench will tell us whether the remaining tee-hash alone is clean or whether the regression persists and the PR should close.
Reverts 2c1e83b. That commit flipped the argument order as a one-off probe to disambiguate hyperfine ordering bias from real regression on #278 — it served its diagnostic purpose (confirmed ~1-2% second-command-wins bias plus ~2% real overhead in the since-reverted per-entry hash interleave) and should not land on main. Put the workflow back to `HEAD main` so historical bench comparisons across PRs stay apples-to-apples.
Reverts 0f46687. The prewarm phase was added on this PR to test whether cold-cache ordering was driving the "main is a bit faster in every PR" pattern. It wasn't — the pre-warmed run showed the same 1.04× ratio as without. The ordering bias we eventually pinned down (~1-2% advantage for whichever command hyperfine runs second) lives somewhere hyperfine's own plumbing, not in the shared state the prewarm covered. Since the prewarm didn't solve what it was added for and this PR's scope is a tarball-side perf fix, drop it. A separate PR can tackle the framework bias properly (randomizing order, averaging over multiple invocations, or switching to a tool that interleaves) without piggybacking on an unrelated review.
The previous commit on this branch picked `swc_malloc`, a meta-crate that selects mimalloc on macOS / Windows and jemalloc on Linux at compile time. The first CI integrated-benchmark run (pnpm#188 at 395a9d2) came back with `pacquet@HEAD` 2% slower than `pacquet@main` — a 57 ms gap inside noise bands (ratio 1.02 ± 0.04), which is consistent with the 1-2% "second-command wins" ordering bias we measured separately on pnpm#278. That reading tells us jemalloc on this Linux runner (glibc 2.35+) gives us nothing measurable over the system allocator. Before closing pnpm#188, run one more comparison with mimalloc instead of jemalloc — mimalloc has a different allocator philosophy (per-thread free lists + small-object fast path) and is a stronger fit for the "thousands of short-lived small allocations" shape a package manager produces. Diff is minimal: * `swc_malloc` → `mimalloc` in workspace + `pacquet-cli` deps. * `extern crate swc_malloc;` → explicit `#[global_allocator] static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;` in `crates/cli/src/lib.rs`. * Comment rewritten to describe what this allocator actually is rather than the per-target selection swc_malloc was doing. If the next CI bench shows mimalloc also within noise of the system allocator, this PR closes — the allocator swap doesn't deliver on the Linux runner we measure against. If mimalloc wins, we land the direct-mimalloc form. Co-authored-by: 강동윤 (Donny) <kdy1997.dev@gmail.com>
|
It doesn't seem like any of these changes improve performance. I am closing it. |
Summary
After a long iteration cycle (see commit history), this PR lands a narrow set of changes to the download path. The more ambitious fixes from
investigations/pacquet-macos-perf.mdwere tried and reverted after CI benchmarks showed them regressing wall-time on the frozen-lockfile scenario.What's in the PR
ssri::IntegrityCheckeras it arrives. Replaces the pre-PR pattern ofresponse.bytes().await+package_integrity.check(&response)— one linear pass over the compressed bytes for the integrity verdict instead of two (buffer once, hash the buffer).worker/src/start.tsordering (crypto.hash→gunzipSync), so a corrupt-server-response error is reported asTarballError::ChecksumbeforeTarballError::DecodeGzipgets a chance to mask it.index.dbrows) and that internal implementation details may diverge from upstream when the divergence delivers a measured perf win and leaves observable outputs byte-identical. Requires aWhy we diverge:note on any such divergence so it stays auditable.Net size: the tarball changes are a small refactor around the existing
run_without_mem_cacheflow plus a ~30-line comment documenting why we don't take the fuller streaming path (see below).What was tried and reverted
async_compression::tokio::write::GzipDecoderto overlap network with decompression. Forced the gzip backend toflate2/miniz_oxide(the only gzip feature that crate ships), which is noticeably slower thanzune-inflateon Apple Silicon / commodity x86. Toxiproxy-injected latency runs showed the PR 15% slower at 0 ms loopback, 13% slower at 50 ms, break-even at 200 ms. The overlap never overtook the decompressor regression.read_to_end+Sha512::digest→ chunked read+hash+try_reserve+extend_from_slice). Added ~2-3% overhead at 0 ms loopback in CI bench runs. The per-chunk branches cost more than the saved second pass at typical npm-entry sizes (small KiB bodies that fit comfortably in warm cache).A block comment at the streaming site in
crates/tarball/src/lib.rsrecords these numbers and the reasoning so the next person who reaches forasync-compressiondoesn't have to re-derive the regression.Upstream reference
fetching/tarball-fetcher/src/remoteTarballFetcher.ts+worker/src/start.ts+store/cafs/src/addFilesFromTarball.tsat pnpm/pnpm@1819226. Upstream buffers the body into a NodeSharedArrayBufferand runscrypto.hash+zlib.gunzipSyncsequentially. This PR's compressed-side tee hash is an internal-detail divergence (bytes hashed during stream consumption rather than after buffering); observable output is identical.Test plan
just ready— 192/192 tests pass, clippy-clean, fmt-cleanpackages_under_orgs_should_work— real-network end-to-end download through the tee-hash pipeline, produces the same CAFS layoutshould_throw_error_on_checksum_mismatch— integrity mismatch still surfacesTarballError::Checksum