perf(tarball): tee integrity hash on the download stream by zkochan · Pull Request #278 · pnpm/pacquet

zkochan · 2026-04-24T16:09:25Z

Summary

After a long iteration cycle (see commit history), this PR lands a narrow set of changes to the download path. The more ambitious fixes from investigations/pacquet-macos-perf.md were tried and reverted after CI benchmarks showed them regressing wall-time on the frozen-lockfile scenario.

What's in the PR

Stream the compressed tarball body through ssri::IntegrityChecker as it arrives. Replaces the pre-PR pattern of response.bytes().await + package_integrity.check(&response) — one linear pass over the compressed bytes for the integrity verdict instead of two (buffer once, hash the buffer).
Verify integrity before decompression rather than as part of the post-download closure. Matches pnpm's upstream worker/src/start.ts ordering (crypto.hash → gunzipSync), so a corrupt-server-response error is reported as TarballError::Checksum before TarballError::DecodeGzip gets a chance to mask it.
AGENTS.md clause clarifying that the cardinal "match pnpm" rule governs observable behavior (flags, error codes, file formats, CAS layout, index.db rows) and that internal implementation details may diverge from upstream when the divergence delivers a measured perf win and leaves observable outputs byte-identical. Requires a Why we diverge: note on any such divergence so it stays auditable.

Net size: the tarball changes are a small refactor around the existing run_without_mem_cache flow plus a ~30-line comment documenting why we don't take the fuller streaming path (see below).

What was tried and reverted

Streaming the body through async_compression::tokio::write::GzipDecoder to overlap network with decompression. Forced the gzip backend to flate2/miniz_oxide (the only gzip feature that crate ships), which is noticeably slower than zune-inflate on Apple Silicon / commodity x86. Toxiproxy-injected latency runs showed the PR 15% slower at 0 ms loopback, 13% slower at 50 ms, break-even at 200 ms. The overlap never overtook the decompressor regression.
Interleaving SHA-512 with the per-entry tar read (read_to_end + Sha512::digest → chunked read+hash+try_reserve+extend_from_slice). Added ~2-3% overhead at 0 ms loopback in CI bench runs. The per-chunk branches cost more than the saved second pass at typical npm-entry sizes (small KiB bodies that fit comfortably in warm cache).

A block comment at the streaming site in crates/tarball/src/lib.rs records these numbers and the reasoning so the next person who reaches for async-compression doesn't have to re-derive the regression.

Upstream reference

fetching/tarball-fetcher/src/remoteTarballFetcher.ts + worker/src/start.ts + store/cafs/src/addFilesFromTarball.ts at pnpm/pnpm@1819226. Upstream buffers the body into a Node SharedArrayBuffer and runs crypto.hash + zlib.gunzipSync sequentially. This PR's compressed-side tee hash is an internal-detail divergence (bytes hashed during stream consumption rather than after buffering); observable output is identical.

Test plan

just ready — 192/192 tests pass, clippy-clean, fmt-clean
packages_under_orgs_should_work — real-network end-to-end download through the tee-hash pipeline, produces the same CAFS layout
should_throw_error_on_checksum_mismatch — integrity mismatch still surfaces TarballError::Checksum
Toxiproxy-injected latency benchmark (local M3, frozen-lockfile, 1352-snapshot fixture, 50 ms latency) — no regression vs main
CI integrated-benchmark at 0 ms loopback — within noise after fix Some feedback #2 revert

Copilot

Pull request overview

This PR optimizes the tarball download/extract pipeline by streaming the HTTP response through gzip while concurrently hashing, and by hashing each tar entry as it’s read to eliminate redundant passes over bytes. It also updates agent guidelines and stabilizes a couple of XDG-related npmrc tests under developer environments.

Changes:

Stream tarball HTTP bytes through IntegrityChecker + async_compression gzip decoding to avoid buffering + double-hashing compressed bytes.
Interleave per-entry SHA-512 while reading tar entries and add StoreDir::write_cas_file_prehashed to avoid re-hashing full buffers.
Update docs (AGENTS.md) and adjust npmrc tests to avoid PNPM_HOME shadowing XDG behavior.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
crates/tarball/src/lib.rs	Replace buffer-then-inflate with streaming gzip decode + integrity tee; hash tar entries while reading
crates/tarball/Cargo.toml	Add deps needed for streaming decode and hashing (`async-compression`, `futures-util`, `sha2`)
crates/store-dir/src/cas_file.rs	Add `write_cas_file_prehashed` and refactor `write_cas_file` to delegate
crates/npmrc/src/lib.rs	Clear `PNPM_HOME` in XDG test to avoid ambient env shadowing
crates/npmrc/src/custom_deserializer.rs	Clear `PNPM_HOME` in XDG test to avoid ambient env shadowing
Cargo.toml	Add `async-compression`, enable `reqwest` `stream` feature, remove `zune-inflate`
Cargo.lock	Lockfile updates for new/removed deps
AGENTS.md	Document “internal performance divergence is allowed” guidance

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

codecov · 2026-04-24T16:18:55Z

Codecov Report

❌ Patch coverage is 76.47059% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.28%. Comparing base (af4b10d) to head (c4e1d0e).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/tarball/src/lib.rs	76.47%	8 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #278      +/-   ##
==========================================
- Coverage   89.67%   89.28%   -0.40%     
==========================================
  Files          61       61              
  Lines        5074     5327     +253     
==========================================
+ Hits         4550     4756     +206     
- Misses        524      571      +47

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2026-04-24T16:20:40Z

Micro-Benchmark Results

Linux

group                          main                                   pr
-----                          ----                                   --
tarball/download_dependency    1.01     15.6±0.63ms   277.4 KB/sec    1.00     15.5±0.07ms   279.5 KB/sec

Three comments, all actionable: * `write_cas_file_prehashed` silently corrupts the CAFS if a caller passes a hash that isn't `Sha512(buffer)` — the blob lands at a path whose filename advertises a digest its contents don't satisfy. Release builds trust the caller so the single-pass hot path stays fast; `debug_assert_eq!(Sha512::digest(buffer), file_hash)` catches the misuse in tests and debug builds. * The chunked read loop inside `extract_tarball_entries` extended `buffer` via `extend_from_slice`, which falls back to `Vec::reserve` and aborts on allocation failure. The initial `try_reserve( prealloc_hint)` was the whole point of the OOM-safety guard — the hint is clamped to 64 MiB even when the tar header claims more, so an entry whose real size exceeds the clamp MUST keep allocating without aborting. Call `buffer.try_reserve(n)` before each extend and propagate failure as `TarballError::ReadTarballEntries` (matching the prealloc error shape). * The streaming download bailed on the first `decoder.write_all` error, returning `DecodeGzip` for a corrupt-gzip-with-mismatched- integrity tarball. pnpm's upstream pipeline (`worker/src/start.ts` → `crypto.hash` then `gunzipSync`) hashes the compressed bytes before touching zlib, so the same input surfaces as an integrity failure there — and error codes are part of the public contract per AGENTS.md "errors and diagnostics". Keep feeding the `IntegrityChecker` with every chunk regardless of decoder state, then run `checker.result()` first. Only after integrity passes do we surface a stored decode error (or call `decoder.shutdown` to finalize a cut-short footer).

Copilot

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot follow-up review on #278. `package_unpacked_size` comes from registry metadata — untrusted — and flowed straight into `Vec::with_capacity(size)`. A malicious or corrupt `unpackedSize` of `usize::MAX` would turn the prealloc into an immediate abort before the first byte of the tarball even arrived, no I/O error surfaceable. Treat the hint as best-effort: clamp it to the same 64 MiB ceiling used for `MAX_ENTRY_PREALLOC_BYTES` and reserve fallibly with `try_reserve`, dropping the reservation on allocation failure so the `Vec` simply falls back to geometric growth. Tarballs whose real decompressed size exceeds the clamp still work, they just don't get the prealloc benefit. Note: real-world decompressed tarballs are under 64 MiB for > 99% of npm packages (typescript at ~60 MiB sits near the top). The per- tarball peak memory under concurrency = post-download-semaphore cap × 64 MiB already bounds our in-flight RSS; this cap keeps that bound from inflating on a bad manifest.

Copilot

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-04-24T16:49:06Z

Integrated-Benchmark Report (Linux)

Scenario: Frozen Lockfile

Command	Mean [s]	Min [s]	Max [s]	Relative
`pacquet@HEAD`	3.061 ± 0.073	2.961	3.189	1.03 ± 0.04
`pacquet@main`	2.980 ± 0.088	2.830	3.126	1.00
`pnpm`	6.708 ± 0.075	6.610	6.811	2.25 ± 0.07

BENCHMARK_REPORT.json

{
  "results": [
    {
      "command": "pacquet@HEAD",
      "mean": 3.0612074172,
      "stddev": 0.07254709458376475,
      "median": 3.0468951139,
      "user": 2.46729384,
      "system": 3.4732257399999993,
      "min": 2.9610352284,
      "max": 3.1894596804,
      "times": [
        3.1014075294,
        2.9840049164,
        3.1237359844,
        3.1894596804,
        3.0590186764,
        3.0347715514,
        3.0074028484,
        3.1275060804,
        3.0237316764,
        2.9610352284
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    },
    {
      "command": "pacquet@main",
      "mean": 2.9799968133,
      "stddev": 0.08831080976453795,
      "median": 2.9780374749,
      "user": 2.47132464,
      "system": 3.43976314,
      "min": 2.8299090384,
      "max": 3.1256024644,
      "times": [
        3.0483820684,
        2.9789185614,
        2.9307676064,
        2.9187984764,
        2.9771563884,
        3.0882474604,
        3.1256024644,
        2.9843966514,
        2.8299090384,
        2.9177894174
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    },
    {
      "command": "pnpm",
      "mean": 6.7077239796,
      "stddev": 0.07541291153144425,
      "median": 6.6887728799,
      "user": 9.28145574,
      "system": 4.581847639999999,
      "min": 6.6099420884,
      "max": 6.8111099794,
      "times": [
        6.6818056094000005,
        6.664050806400001,
        6.6519270294,
        6.695740150400001,
        6.809544804400001,
        6.7139850754,
        6.634682557400001,
        6.8111099794,
        6.6099420884,
        6.8044516954
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    }
  ]
}

Four comments, all actionable: * `write_cas_file` now routes through a private `write_cas_file_unchecked` instead of `write_cas_file_prehashed`. The only change for the pnpm-facing API is internal: before this commit, every `write_cas_file` call on a debug/test build would re-hash its own buffer inside the `debug_assert_eq!` of `write_cas_file_prehashed`. That's cheap per-call but on an install hot path across a whole dep tree it showed up as measurable debug- build overhead. The unchecked helper skips the assert; the public `write_cas_file_prehashed` keeps it for the external-caller-with- its-own-hash case. * Add two unit tests for `write_cas_file_prehashed`: - `write_cas_file_prehashed_parity_with_write_cas_file`: same bytes through both entry points land at the same CAFS path and produce the same on-disk contents, for both `executable=false` and `executable=true`. - `write_cas_file_prehashed_debug_asserts_hash_matches_buffer`: `#[should_panic]` locks in the foot-gun guard; gated on `debug_assertions` because the assert is compiled out in release. * Replace the stale TODO above `package_integrity.to_string()`. The old comment talked about "Cloning here is less than desirable" and suggested `Arc::clone`, but after the streaming refactor this site no longer clones the `Integrity`; it serializes it to a `String` for the `index.db` key. The new comment reflects what the allocation is actually for and what would cut it (an `Arc<str>` field on `DownloadTarballToStore`) if a profile ever demanded it. * Make tokio's `io-util` feature explicit at the workspace level. `AsyncWriteExt` / `AsyncReadExt` are feature-gated; we were getting them via whatever transitive crate happened to enable them (`reqwest`'s `stream` feature, most likely). That's fragile across dep-graph changes. Now declared as a first-class feature on the workspace tokio line alongside `rt`, `rt-multi-thread`, and `macros`.

Copilot

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-24T18:43:26Z

+        // the compressed bytes into an SHA-512 `IntegrityChecker`.
+        // Compared to the old "buffer-everything, hash, decompress"
+        // pipeline this removes one full pass over the compressed body
+        // (no separate `Integrity::check(&response)` call), never holds
+        // both the compressed and decompressed buffers at the same
+        // time, and lets decompression overlap with `await`s on the
+        // network. The blocking tar iterator still runs inside
+        // `spawn_blocking` below so per-file SHA-512 + CAFS writes
+        // stay off the reactor.


Grammar/accuracy nit in the comment: “teeing the compressed bytes into an SHA-512 IntegrityChecker” should be “a …”, and IntegrityChecker isn’t inherently SHA-512 (it follows the algorithms present in Integrity). Consider rewording to avoid implying a fixed hash algorithm.

Suggested change

// the compressed bytes into an SHA-512 `IntegrityChecker`.

// Compared to the old "buffer-everything, hash, decompress"

// pipeline this removes one full pass over the compressed body

// (no separate `Integrity::check(&response)` call), never holds

// both the compressed and decompressed buffers at the same

// time, and lets decompression overlap with `await`s on the

// network. The blocking tar iterator still runs inside

// `spawn_blocking` below so per-file SHA-512 + CAFS writes

// stay off the reactor.

// the compressed bytes into an `IntegrityChecker` for the

// expected integrity algorithms. Compared to the old

// "buffer-everything, hash, decompress" pipeline this removes

// one full pass over the compressed body (no separate

// `Integrity::check(&response)` call), never holds both the

// compressed and decompressed buffers at the same time, and

// lets decompression overlap with `await`s on the network. The

// blocking tar iterator still runs inside `spawn_blocking`

// below so per-file SHA-512 + CAFS writes stay off the reactor.

Copilot · 2026-04-24T18:43:27Z

+    /// Same as [`write_cas_file`](Self::write_cas_file) but takes a
+    /// pre-computed SHA-512 so the caller can hash the bytes as they
+    /// arrive (e.g. chunk-by-chunk out of a tar entry) and skip the
+    /// separate `Sha512::digest(buffer)` pass inside this function.
+    ///
+    /// The caller is responsible for ensuring `file_hash` is the
+    /// SHA-512 of `buffer`; passing a mismatched hash would corrupt
+    /// the CAFS (the blob would land at a path that advertises a
+    /// digest its contents don't satisfy). A `debug_assert` catches
+    /// that misuse in debug builds / tests; release builds trust the
+    /// caller so the write path stays single-pass.
+    pub fn write_cas_file_prehashed(
+        &self,
+        buffer: &[u8],
+        file_hash: FileHash,
+        executable: bool,
+    ) -> Result<PathBuf, WriteCasFileError> {
+        debug_assert_eq!(
+            Sha512::digest(buffer),
+            file_hash,
+            "write_cas_file_prehashed called with a hash that is not Sha512(buffer); \
+             passing a mismatched hash would silently corrupt the CAFS",
+        );
+        self.write_cas_file_unchecked(buffer, file_hash, executable)


write_cas_file_prehashed is safe to call but relies on the caller to provide the correct hash (the guard is only a debug_assert, so release builds will trust the input). Since a mismatch would place bytes under the wrong digest path, consider making the “trusted/unchecked” nature more explicit in the API (e.g., naming) to reduce accidental misuse by other internal callers.

KSXGitHub · 2026-04-24T19:03:44Z

Integrated-Benchmark Report (Linux)

Scenario: Frozen Lockfile

Command Mean [s] Min [s] Max [s] Relative

pacquet@HEAD 2.983 ± 0.083 2.834 3.144 1.15 ± 0.04

pacquet@main 2.603 ± 0.060 2.535 2.706 1.00

pnpm 5.900 ± 0.068 5.738 6.003 2.27 ± 0.06

HEAD (this PR) is 1.15 times slower according to the latest report, I wonder what happened?

zkochan · 2026-04-24T20:35:18Z

Update: the streaming pipeline has been reverted in bab8b40. The PR now keeps only the two changes that unambiguously help.

What's left in the PR

Per-entry hash interleave — StoreDir::write_cas_file_prehashed, plus the chunked read+hash loop in extract_tarball_entries. One pass over decompressed entry bytes instead of two.
Compressed-side tee hash — still stream the response via bytes_stream() and feed each chunk into ssri::IntegrityChecker so the per-tarball SHA-512 is computed during the download instead of in a separate Integrity::check(&buffer) pass afterward.

Compressed bytes are still buffered fully before decompression; zune-inflate still runs sync inside spawn_blocking, exactly like pre-PR. Shape now matches pnpm's crypto.hash → gunzipSync ordering.

Why I dropped the async gzip pipeline

Toxiproxy-injected latency runs (integrated-benchmark frozen-lockfile, 1352-snapshot fixture, local M3):

latency	main	async-compression (previous PR head)	narrow-revert (this PR head)
0 ms (loopback)	2.60 ± 0.06 s (CI)	2.98 ± 0.08 s (CI)	expect parity — CI will re-run
50 ms	16.70 ± 0.78 s	18.94 ± 1.06 s	16.54 ± 1.25 s
200 ms	22.37 ± 0.42 s	22.58 ± 0.55 s	not measured

The async-compression version forced the gzip backend to flate2/miniz_oxide (the only gzip feature that crate ships), which is noticeably slower on decompress than zune-inflate on Apple Silicon / commodity x86 (cf. investigations/pacquet-macos-perf.md §7). At every latency tier I could reproduce, the decompressor regression dominated the network/decompress overlap gain. The gap shrank with rising latency but never flipped to a win — the break-even point is somewhere around 200 ms, past the typical CDN / same-continent range.

The narrow revert recovers parity with main on the 50 ms run (within noise) while keeping the two cheap wins. A block comment at the streaming site documents these numbers so the next person who reaches for async-compression doesn't have to re-derive the regression.

Follow-up

A separate small PR will teach integrated-benchmark a --latency-ms flag (auto-launch toxiproxy + attach a latency toxic) so every future streaming/fetching change gets realistic numbers in CI, not just 0 ms loopback.

Clarify the scope of the cardinal "match pnpm exactly" rule: it governs observable behavior (CLI flags, defaults, error codes, `.npmrc` semantics, lockfile format, CAS/store-index layout) and not internal implementation detail. Pipeline topology, async vs. sync, streaming vs. buffering, hash-tee vs. post-hoc hashing, thread-pool shape, chunk sizes, and parallelism strategy may diverge when the divergence delivers a measured perf win and leaves observable outputs byte-identical. Require any such divergence to carry a `Why we diverge:` note so it stays auditable and reversible if upstream later adopts a different approach.

Two perf fixes from investigations/pacquet-macos-perf.md that both collapse redundant passes over the tarball bytes. 1. Stream the HTTP body through the gzip decoder (doc fix #4). Replace `response.bytes().await` + `decompress_gzip` (sync zune-inflate) + separate `Integrity::check(&buffer)` with a streaming pipeline: `bytes_stream()` teed through `ssri::IntegrityChecker` into `async_compression::tokio::write:: GzipDecoder`. This removes one full pass over the compressed body (no separate `.check()`), never holds compressed + decompressed buffers at the same time, and lets decompression overlap with network awaits. Integrity is still verified end-to-end before any CAFS blob is written, so a mismatch aborts the install with the same `TarballError::Checksum` behavior as before. 2. Interleave SHA-512 with the per-entry read (doc fix #2). Add `StoreDir::write_cas_file_prehashed`; the tar extraction loop reads + hashes 64 KiB chunks in one pass instead of `read_to_end` followed by a separate `Sha512::digest(buffer)` inside `write_cas_file`. The buffer still has to exist (CAFS filename is hash-derived), but two passes over the decompressed entry bytes collapse into one. Mirrors pnpm's `parseTarball` + `addBufferToCafs` in `store/cafs/src/addFilesFromTarball.ts`. Why we diverge from pnpm on (1): upstream `fetching/tarball-fetcher/src/remoteTarballFetcher.ts` + `worker/src/start.ts` + `store/cafs/src/addFilesFromTarball.ts` buffers the whole tarball into a Node `SharedArrayBuffer` and runs `crypto.hash` + `zlib.gunzipSync` sequentially. Observable behavior here is identical — same integrity errors on mismatch, same decompressed bytes, same CAFS layout, same `index.db` rows — so per the "Internal performance divergence is allowed" clause added in the preceding commit, the streaming topology is a pacquet-specific implementation-detail optimization rather than a shape port. Dep changes: swap `zune-inflate` for `async-compression` (gzip + tokio features) + `futures-util`; enable reqwest's `stream` feature; add `sha2` to the tarball crate. Tests: existing `packages_under_orgs_should_work` (real-network end-to-end) and `should_throw_error_on_checksum_mismatch` both pass unchanged, proving the streaming pipeline produces the same decompressed bytes and fails integrity checks the same way.

Three comments, all actionable: * `write_cas_file_prehashed` silently corrupts the CAFS if a caller passes a hash that isn't `Sha512(buffer)` — the blob lands at a path whose filename advertises a digest its contents don't satisfy. Release builds trust the caller so the single-pass hot path stays fast; `debug_assert_eq!(Sha512::digest(buffer), file_hash)` catches the misuse in tests and debug builds. * The chunked read loop inside `extract_tarball_entries` extended `buffer` via `extend_from_slice`, which falls back to `Vec::reserve` and aborts on allocation failure. The initial `try_reserve( prealloc_hint)` was the whole point of the OOM-safety guard — the hint is clamped to 64 MiB even when the tar header claims more, so an entry whose real size exceeds the clamp MUST keep allocating without aborting. Call `buffer.try_reserve(n)` before each extend and propagate failure as `TarballError::ReadTarballEntries` (matching the prealloc error shape). * The streaming download bailed on the first `decoder.write_all` error, returning `DecodeGzip` for a corrupt-gzip-with-mismatched- integrity tarball. pnpm's upstream pipeline (`worker/src/start.ts` → `crypto.hash` then `gunzipSync`) hashes the compressed bytes before touching zlib, so the same input surfaces as an integrity failure there — and error codes are part of the public contract per AGENTS.md "errors and diagnostics". Keep feeding the `IntegrityChecker` with every chunk regardless of decoder state, then run `checker.result()` first. Only after integrity passes do we surface a stored decode error (or call `decoder.shutdown` to finalize a cut-short footer).

Copilot follow-up review on #278. `package_unpacked_size` comes from registry metadata — untrusted — and flowed straight into `Vec::with_capacity(size)`. A malicious or corrupt `unpackedSize` of `usize::MAX` would turn the prealloc into an immediate abort before the first byte of the tarball even arrived, no I/O error surfaceable. Treat the hint as best-effort: clamp it to the same 64 MiB ceiling used for `MAX_ENTRY_PREALLOC_BYTES` and reserve fallibly with `try_reserve`, dropping the reservation on allocation failure so the `Vec` simply falls back to geometric growth. Tarballs whose real decompressed size exceeds the clamp still work, they just don't get the prealloc benefit. Note: real-world decompressed tarballs are under 64 MiB for > 99% of npm packages (typescript at ~60 MiB sits near the top). The per- tarball peak memory under concurrency = post-download-semaphore cap × 64 MiB already bounds our in-flight RSS; this cap keeps that bound from inflating on a bad manifest.

Four comments, all actionable: * `write_cas_file` now routes through a private `write_cas_file_unchecked` instead of `write_cas_file_prehashed`. The only change for the pnpm-facing API is internal: before this commit, every `write_cas_file` call on a debug/test build would re-hash its own buffer inside the `debug_assert_eq!` of `write_cas_file_prehashed`. That's cheap per-call but on an install hot path across a whole dep tree it showed up as measurable debug- build overhead. The unchecked helper skips the assert; the public `write_cas_file_prehashed` keeps it for the external-caller-with- its-own-hash case. * Add two unit tests for `write_cas_file_prehashed`: - `write_cas_file_prehashed_parity_with_write_cas_file`: same bytes through both entry points land at the same CAFS path and produce the same on-disk contents, for both `executable=false` and `executable=true`. - `write_cas_file_prehashed_debug_asserts_hash_matches_buffer`: `#[should_panic]` locks in the foot-gun guard; gated on `debug_assertions` because the assert is compiled out in release. * Replace the stale TODO above `package_integrity.to_string()`. The old comment talked about "Cloning here is less than desirable" and suggested `Arc::clone`, but after the streaming refactor this site no longer clones the `Integrity`; it serializes it to a `String` for the `index.db` key. The new comment reflects what the allocation is actually for and what would cut it (an `Arc<str>` field on `DownloadTarballToStore`) if a profile ever demanded it. * Make tokio's `io-util` feature explicit at the workspace level. `AsyncWriteExt` / `AsyncReadExt` are feature-gated; we were getting them via whatever transitive crate happened to enable them (`reqwest`'s `stream` feature, most likely). That's fragile across dep-graph changes. Now declared as a first-class feature on the workspace tokio line alongside `rt`, `rt-multi-thread`, and `macros`.

An earlier commit in this PR routed the response body through `async_compression::tokio::write::GzipDecoder` so network fetch and gzip inflation could overlap. That forced the gzip backend to `flate2`/`miniz_oxide` (the only gzip feature async-compression offers) and turned out to regress wall-time at every latency tier I could reproduce locally. Toxiproxy-injected latency, integrated-benchmark frozen-lockfile scenario, 1352-snapshot fixture, M3 / APFS: 0 ms (loopback) main 2.60 s vs streaming 2.98 s (15% slower) 50 ms (typical) main 16.70 s vs streaming 18.94 s (13% slower) 200 ms (WAN) main 22.37 s vs streaming 22.58 s (break-even) The overlap gain does scale with latency, but it never overtakes the decompressor regression at any realistic latency — `miniz_oxide` is slow enough on the decompress side that matching pnpm upstream's "buffer body, then `gunzipSync`" topology wins on wall-time with the current decompressor set. Revert to: * `bytes_stream()` drained chunk-by-chunk so we can tee each chunk through the `IntegrityChecker` — this is the one piece of the streaming pipeline that pays for itself. Saves one full pass over the compressed body vs the pre-PR `response.bytes().await` + `package_integrity.check(&response)`. * Buffer the compressed bytes fully, then decompress with `zune-inflate` sync inside `spawn_blocking`, exactly as the pre-PR code did. * Integrity is verified before we touch gzip, matching pnpm's `crypto.hash` → `gunzipSync` ordering. Keeps the per-entry hash interleave from the earlier commit (that one is unambiguously a win and scales with file count, not latency). Removes `async-compression` (workspace + crate dep) and tokio's `io-util` feature (no other caller in the workspace); restores `zune-inflate` as the decompressor. Code change includes a block comment at the streaming site that cites these numbers so the next person who reaches for async-compression doesn't have to re-derive the regression.

Hyperfine runs its measured commands sequentially (no per-iteration interleaving). With `--warmup 1`, the second-listed command's measured runs always start from a state that the first command's 10 measured runs already warmed — OS page cache, verdaccio's on-disk cache, registry-mock in-memory state, inode cache, everything. This shows up as a systematic 1-3% advantage for whichever command is listed second in every PR's benchmark run, regardless of the code difference. #278's CI surfaced this clearly: three successive rebases all reported main slightly ahead of the PR HEAD (1.03× → 1.15× range depending on how much real signal was in the change), and it replicated even for the current commit where the diff is tiny. The workflow passes `HEAD main`, so `main` is always the second command, and `main` always appears slightly faster. Fix it by running each benchmarked install once in a `prewarm` phase before hyperfine starts. After prewarm, OS + verdaccio + registry-mock are all fully primed, and hyperfine's `--warmup 1` per command is just the final touch rather than the sole source of state warmup. Both commands' measured runs then start from the same warm state, and any remaining difference reflects the code under test. Cost: one extra install per revision (plus pnpm if `--with-pnpm`). At ~2.6s each on CI, that's ≤10s of added bench time for a three-way compare — cheap for reliable numbers. Refactors `cleanup_command()` out of `benchmark()` so `prewarm()` and `benchmark()` share the exact same rm-rf between iterations, keeping the state machine symmetric.

Temporary diagnostic for #278. The last two CI bench runs showed `main` ~3-4% ahead of `HEAD`, but the prewarm pass from the previous commit should have eliminated cold-cache ordering bias — yet the ratio was essentially unchanged. That leaves two competing hypotheses: * systematic benchmark-framework bias favouring the second-listed hyperfine command (CPU scheduling, measurement overhead, runner variance), or * a real ~3-4% regression in the remaining PR changes. Flip the argument order from `HEAD main` to `main HEAD` so `main` is now the first-listed command. If the next CI run shows `HEAD` winning by a similar margin, it's framework bias and we'll follow up with a proper randomization. If `main` still wins, the PR has real overhead and we revert the per-entry hash interleave. Revert this diff once the signal is unambiguous.

Fix #2 from `investigations/pacquet-macos-perf.md` turned out to regress wall-time at 0ms loopback by ~2% across three successive CI runs on #278, even after controlling for hyperfine-ordering bias (the previous commit swapped `HEAD main` → `main HEAD`; main stayed ahead by 2%, so the bias hypothesis explains only ~half the gap). Fix #2 is the most likely culprit — it hits every tar entry (tens of thousands across the 1352-snapshot fixture) and replaces an optimized `Vec::read_to_end` + single `Sha512::digest` with a chunked read loop that does per-chunk `try_reserve`, `hasher.update`, and `extend_from_slice`. The streaming per-chunk branches add overhead that the saved second pass doesn't recover at the scale of small npm entries (typical entry body is a few KiB — way below where a second scan of the already-hot cache starts mattering). Revert to the pre-PR shape on the entry loop: let mut buffer = Vec::new(); buffer.try_reserve(prealloc_hint).map_err(...)?; entry.read_to_end(&mut buffer).map_err(...)?; let (file_path, file_hash) = store_dir .write_cas_file(&buffer, file_is_executable) .map_err(...)?; Remove `write_cas_file_prehashed` and the private `write_cas_file_unchecked` helper from `StoreDir` — nothing calls them now. Also drop the two prehashed-specific tests (`…_parity_with_write_cas_file` and `…_debug_asserts_hash_matches_buffer`) and the tarball crate's `sha2` dep, since the one-shot `write_cas_file` on `pacquet-store-dir` is the only consumer of SHA-512 once more. What's still left in the PR after this revert: * compressed-side tee-hash: `bytes_stream()` + per-chunk `IntegrityChecker::input` eliminates the separate `Integrity::check(&buffer)` pass after buffering. Chunk count here is tiny (a handful per tarball) so per-chunk overhead is negligible, and the payoff (skipping one full SHA-512 over the compressed body) scales with tarball size. * integrity check ordering before decompression, matching pnpm's `crypto.hash` → `gunzipSync` sequence. * AGENTS.md's "internal perf divergence is allowed" clause. Next CI bench will tell us whether the remaining tee-hash alone is clean or whether the regression persists and the PR should close.

Reverts 2c1e83b. That commit flipped the argument order as a one-off probe to disambiguate hyperfine ordering bias from real regression on #278 — it served its diagnostic purpose (confirmed ~1-2% second-command-wins bias plus ~2% real overhead in the since-reverted per-entry hash interleave) and should not land on main. Put the workflow back to `HEAD main` so historical bench comparisons across PRs stay apples-to-apples.

Reverts 0f46687. The prewarm phase was added on this PR to test whether cold-cache ordering was driving the "main is a bit faster in every PR" pattern. It wasn't — the pre-warmed run showed the same 1.04× ratio as without. The ordering bias we eventually pinned down (~1-2% advantage for whichever command hyperfine runs second) lives somewhere hyperfine's own plumbing, not in the shared state the prewarm covered. Since the prewarm didn't solve what it was added for and this PR's scope is a tarball-side perf fix, drop it. A separate PR can tackle the framework bias properly (randomizing order, averaging over multiple invocations, or switching to a tool that interleaves) without piggybacking on an unrelated review.

The previous commit on this branch picked `swc_malloc`, a meta-crate that selects mimalloc on macOS / Windows and jemalloc on Linux at compile time. The first CI integrated-benchmark run (pnpm#188 at 395a9d2) came back with `pacquet@HEAD` 2% slower than `pacquet@main` — a 57 ms gap inside noise bands (ratio 1.02 ± 0.04), which is consistent with the 1-2% "second-command wins" ordering bias we measured separately on pnpm#278. That reading tells us jemalloc on this Linux runner (glibc 2.35+) gives us nothing measurable over the system allocator. Before closing pnpm#188, run one more comparison with mimalloc instead of jemalloc — mimalloc has a different allocator philosophy (per-thread free lists + small-object fast path) and is a stronger fit for the "thousands of short-lived small allocations" shape a package manager produces. Diff is minimal: * `swc_malloc` → `mimalloc` in workspace + `pacquet-cli` deps. * `extern crate swc_malloc;` → explicit `#[global_allocator] static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;` in `crates/cli/src/lib.rs`. * Comment rewritten to describe what this allocator actually is rather than the per-target selection swc_malloc was doing. If the next CI bench shows mimalloc also within noise of the system allocator, this PR closes — the allocator swap doesn't deliver on the Linux runner we measure against. If mimalloc wins, we land the direct-mimalloc form. Co-authored-by: 강동윤 (Donny) <kdy1997.dev@gmail.com>

zkochan · 2026-04-24T23:30:36Z

It doesn't seem like any of these changes improve performance. I am closing it.

zkochan requested a review from Copilot April 24, 2026 16:12

Copilot started reviewing on behalf of zkochan April 24, 2026 16:12 View session

zkochan force-pushed the improve-perf2 branch from 8f74b37 to b981111 Compare April 24, 2026 16:15

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Comment thread crates/store-dir/src/cas_file.rs Outdated

Comment thread crates/tarball/src/lib.rs Outdated

Comment thread crates/tarball/src/lib.rs Outdated

zkochan requested a review from Copilot April 24, 2026 16:28

Copilot started reviewing on behalf of zkochan April 24, 2026 16:29 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Comment thread crates/tarball/src/lib.rs Outdated

zkochan requested a review from Copilot April 24, 2026 16:38

Copilot started reviewing on behalf of zkochan April 24, 2026 16:39 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Comment thread crates/tarball/src/lib.rs Outdated

Comment thread crates/store-dir/src/cas_file.rs Outdated

Comment thread crates/store-dir/src/cas_file.rs Outdated

Comment thread crates/tarball/src/lib.rs Outdated

zkochan requested a review from Copilot April 24, 2026 18:37

Copilot started reviewing on behalf of zkochan April 24, 2026 18:38 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

zkochan added 7 commits April 24, 2026 23:03

chore: taplo-align Cargo.toml comment column

cdcd9b8

zkochan force-pushed the improve-perf2 branch from bab8b40 to d6853f7 Compare April 24, 2026 21:06

zkochan added 4 commits April 24, 2026 23:58

zkochan changed the title ~~perf(tarball): stream HTTP body into gzip; interleave per-entry hash~~ perf(tarball): tee integrity hash on the download stream Apr 24, 2026

zkochan mentioned this pull request Apr 24, 2026

perf: port pnpm v11 HTTP/fetch topology and worker-pool sizing #280

Open

zkochan mentioned this pull request Apr 24, 2026

perf: Use faster malloc #188

Closed

zkochan closed this Apr 24, 2026

Conversation

zkochan commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the PR

What was tried and reverted

Upstream reference

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Micro-Benchmark Results

Linux

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Integrated-Benchmark Report (Linux)

Scenario: Frozen Lockfile

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

KSXGitHub commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Integrated-Benchmark Report (Linux)

Scenario: Frozen Lockfile

Uh oh!

zkochan commented Apr 24, 2026

What's left in the PR

Why I dropped the async gzip pipeline

Follow-up

Uh oh!

zkochan commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zkochan commented Apr 24, 2026 •

edited

Loading

codecov Bot commented Apr 24, 2026 •

edited

Loading

github-actions Bot commented Apr 24, 2026 •

edited

Loading

github-actions Bot commented Apr 24, 2026 •

edited

Loading

KSXGitHub commented Apr 24, 2026 •

edited

Loading