Skip to content

perf(tarball): tee integrity hash on the download stream#278

Closed
zkochan wants to merge 12 commits intomainfrom
improve-perf2
Closed

perf(tarball): tee integrity hash on the download stream#278
zkochan wants to merge 12 commits intomainfrom
improve-perf2

Conversation

@zkochan
Copy link
Copy Markdown
Member

@zkochan zkochan commented Apr 24, 2026

Summary

After a long iteration cycle (see commit history), this PR lands a narrow set of changes to the download path. The more ambitious fixes from investigations/pacquet-macos-perf.md were tried and reverted after CI benchmarks showed them regressing wall-time on the frozen-lockfile scenario.

What's in the PR

  1. Stream the compressed tarball body through ssri::IntegrityChecker as it arrives. Replaces the pre-PR pattern of response.bytes().await + package_integrity.check(&response) — one linear pass over the compressed bytes for the integrity verdict instead of two (buffer once, hash the buffer).
  2. Verify integrity before decompression rather than as part of the post-download closure. Matches pnpm's upstream worker/src/start.ts ordering (crypto.hashgunzipSync), so a corrupt-server-response error is reported as TarballError::Checksum before TarballError::DecodeGzip gets a chance to mask it.
  3. AGENTS.md clause clarifying that the cardinal "match pnpm" rule governs observable behavior (flags, error codes, file formats, CAS layout, index.db rows) and that internal implementation details may diverge from upstream when the divergence delivers a measured perf win and leaves observable outputs byte-identical. Requires a Why we diverge: note on any such divergence so it stays auditable.

Net size: the tarball changes are a small refactor around the existing run_without_mem_cache flow plus a ~30-line comment documenting why we don't take the fuller streaming path (see below).

What was tried and reverted

  • Streaming the body through async_compression::tokio::write::GzipDecoder to overlap network with decompression. Forced the gzip backend to flate2/miniz_oxide (the only gzip feature that crate ships), which is noticeably slower than zune-inflate on Apple Silicon / commodity x86. Toxiproxy-injected latency runs showed the PR 15% slower at 0 ms loopback, 13% slower at 50 ms, break-even at 200 ms. The overlap never overtook the decompressor regression.
  • Interleaving SHA-512 with the per-entry tar read (read_to_end + Sha512::digest → chunked read+hash+try_reserve+extend_from_slice). Added ~2-3% overhead at 0 ms loopback in CI bench runs. The per-chunk branches cost more than the saved second pass at typical npm-entry sizes (small KiB bodies that fit comfortably in warm cache).

A block comment at the streaming site in crates/tarball/src/lib.rs records these numbers and the reasoning so the next person who reaches for async-compression doesn't have to re-derive the regression.

Upstream reference

fetching/tarball-fetcher/src/remoteTarballFetcher.ts + worker/src/start.ts + store/cafs/src/addFilesFromTarball.ts at pnpm/pnpm@1819226. Upstream buffers the body into a Node SharedArrayBuffer and runs crypto.hash + zlib.gunzipSync sequentially. This PR's compressed-side tee hash is an internal-detail divergence (bytes hashed during stream consumption rather than after buffering); observable output is identical.

Test plan

  • just ready — 192/192 tests pass, clippy-clean, fmt-clean
  • packages_under_orgs_should_work — real-network end-to-end download through the tee-hash pipeline, produces the same CAFS layout
  • should_throw_error_on_checksum_mismatch — integrity mismatch still surfaces TarballError::Checksum
  • Toxiproxy-injected latency benchmark (local M3, frozen-lockfile, 1352-snapshot fixture, 50 ms latency) — no regression vs main
  • CI integrated-benchmark at 0 ms loopback — within noise after fix Some feedback #2 revert

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the tarball download/extract pipeline by streaming the HTTP response through gzip while concurrently hashing, and by hashing each tar entry as it’s read to eliminate redundant passes over bytes. It also updates agent guidelines and stabilizes a couple of XDG-related npmrc tests under developer environments.

Changes:

  • Stream tarball HTTP bytes through IntegrityChecker + async_compression gzip decoding to avoid buffering + double-hashing compressed bytes.
  • Interleave per-entry SHA-512 while reading tar entries and add StoreDir::write_cas_file_prehashed to avoid re-hashing full buffers.
  • Update docs (AGENTS.md) and adjust npmrc tests to avoid PNPM_HOME shadowing XDG behavior.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
crates/tarball/src/lib.rs Replace buffer-then-inflate with streaming gzip decode + integrity tee; hash tar entries while reading
crates/tarball/Cargo.toml Add deps needed for streaming decode and hashing (async-compression, futures-util, sha2)
crates/store-dir/src/cas_file.rs Add write_cas_file_prehashed and refactor write_cas_file to delegate
crates/npmrc/src/lib.rs Clear PNPM_HOME in XDG test to avoid ambient env shadowing
crates/npmrc/src/custom_deserializer.rs Clear PNPM_HOME in XDG test to avoid ambient env shadowing
Cargo.toml Add async-compression, enable reqwest stream feature, remove zune-inflate
Cargo.lock Lockfile updates for new/removed deps
AGENTS.md Document “internal performance divergence is allowed” guidance

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/store-dir/src/cas_file.rs Outdated
Comment thread crates/tarball/src/lib.rs Outdated
Comment thread crates/tarball/src/lib.rs Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 76.47059% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.28%. Comparing base (af4b10d) to head (c4e1d0e).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
crates/tarball/src/lib.rs 76.47% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #278      +/-   ##
==========================================
- Coverage   89.67%   89.28%   -0.40%     
==========================================
  Files          61       61              
  Lines        5074     5327     +253     
==========================================
+ Hits         4550     4756     +206     
- Misses        524      571      +47     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 24, 2026

Micro-Benchmark Results

Linux

group                          main                                   pr
-----                          ----                                   --
tarball/download_dependency    1.01     15.6±0.63ms   277.4 KB/sec    1.00     15.5±0.07ms   279.5 KB/sec

zkochan added a commit that referenced this pull request Apr 24, 2026
Three comments, all actionable:

* `write_cas_file_prehashed` silently corrupts the CAFS if a caller
  passes a hash that isn't `Sha512(buffer)` — the blob lands at a
  path whose filename advertises a digest its contents don't satisfy.
  Release builds trust the caller so the single-pass hot path stays
  fast; `debug_assert_eq!(Sha512::digest(buffer), file_hash)` catches
  the misuse in tests and debug builds.

* The chunked read loop inside `extract_tarball_entries` extended
  `buffer` via `extend_from_slice`, which falls back to `Vec::reserve`
  and aborts on allocation failure. The initial `try_reserve(
  prealloc_hint)` was the whole point of the OOM-safety guard — the
  hint is clamped to 64 MiB even when the tar header claims more, so
  an entry whose real size exceeds the clamp MUST keep allocating
  without aborting. Call `buffer.try_reserve(n)` before each extend
  and propagate failure as `TarballError::ReadTarballEntries`
  (matching the prealloc error shape).

* The streaming download bailed on the first `decoder.write_all`
  error, returning `DecodeGzip` for a corrupt-gzip-with-mismatched-
  integrity tarball. pnpm's upstream pipeline (`worker/src/start.ts`
  → `crypto.hash` then `gunzipSync`) hashes the compressed bytes
  before touching zlib, so the same input surfaces as an integrity
  failure there — and error codes are part of the public contract
  per AGENTS.md "errors and diagnostics". Keep feeding the
  `IntegrityChecker` with every chunk regardless of decoder state,
  then run `checker.result()` first. Only after integrity passes do
  we surface a stored decode error (or call `decoder.shutdown` to
  finalize a cut-short footer).
@zkochan zkochan requested a review from Copilot April 24, 2026 16:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/tarball/src/lib.rs Outdated
zkochan added a commit that referenced this pull request Apr 24, 2026
Copilot follow-up review on #278. `package_unpacked_size` comes from
registry metadata — untrusted — and flowed straight into
`Vec::with_capacity(size)`. A malicious or corrupt `unpackedSize` of
`usize::MAX` would turn the prealloc into an immediate abort before
the first byte of the tarball even arrived, no I/O error surfaceable.

Treat the hint as best-effort: clamp it to the same 64 MiB ceiling
used for `MAX_ENTRY_PREALLOC_BYTES` and reserve fallibly with
`try_reserve`, dropping the reservation on allocation failure so the
`Vec` simply falls back to geometric growth. Tarballs whose real
decompressed size exceeds the clamp still work, they just don't get
the prealloc benefit.

Note: real-world decompressed tarballs are under 64 MiB for > 99%
of npm packages (typescript at ~60 MiB sits near the top). The per-
tarball peak memory under concurrency = post-download-semaphore cap
× 64 MiB already bounds our in-flight RSS; this cap keeps that bound
from inflating on a bad manifest.
@zkochan zkochan requested a review from Copilot April 24, 2026 16:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/tarball/src/lib.rs Outdated
Comment thread crates/store-dir/src/cas_file.rs Outdated
Comment thread crates/store-dir/src/cas_file.rs Outdated
Comment thread crates/tarball/src/lib.rs Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 24, 2026

Integrated-Benchmark Report (Linux)

Scenario: Frozen Lockfile

Command Mean [s] Min [s] Max [s] Relative
pacquet@HEAD 3.061 ± 0.073 2.961 3.189 1.03 ± 0.04
pacquet@main 2.980 ± 0.088 2.830 3.126 1.00
pnpm 6.708 ± 0.075 6.610 6.811 2.25 ± 0.07
BENCHMARK_REPORT.json
{
  "results": [
    {
      "command": "pacquet@HEAD",
      "mean": 3.0612074172,
      "stddev": 0.07254709458376475,
      "median": 3.0468951139,
      "user": 2.46729384,
      "system": 3.4732257399999993,
      "min": 2.9610352284,
      "max": 3.1894596804,
      "times": [
        3.1014075294,
        2.9840049164,
        3.1237359844,
        3.1894596804,
        3.0590186764,
        3.0347715514,
        3.0074028484,
        3.1275060804,
        3.0237316764,
        2.9610352284
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    },
    {
      "command": "pacquet@main",
      "mean": 2.9799968133,
      "stddev": 0.08831080976453795,
      "median": 2.9780374749,
      "user": 2.47132464,
      "system": 3.43976314,
      "min": 2.8299090384,
      "max": 3.1256024644,
      "times": [
        3.0483820684,
        2.9789185614,
        2.9307676064,
        2.9187984764,
        2.9771563884,
        3.0882474604,
        3.1256024644,
        2.9843966514,
        2.8299090384,
        2.9177894174
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    },
    {
      "command": "pnpm",
      "mean": 6.7077239796,
      "stddev": 0.07541291153144425,
      "median": 6.6887728799,
      "user": 9.28145574,
      "system": 4.581847639999999,
      "min": 6.6099420884,
      "max": 6.8111099794,
      "times": [
        6.6818056094000005,
        6.664050806400001,
        6.6519270294,
        6.695740150400001,
        6.809544804400001,
        6.7139850754,
        6.634682557400001,
        6.8111099794,
        6.6099420884,
        6.8044516954
      ],
      "exit_codes": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ]
    }
  ]
}

zkochan added a commit that referenced this pull request Apr 24, 2026
Four comments, all actionable:

* `write_cas_file` now routes through a private
  `write_cas_file_unchecked` instead of `write_cas_file_prehashed`.
  The only change for the pnpm-facing API is internal: before this
  commit, every `write_cas_file` call on a debug/test build would
  re-hash its own buffer inside the `debug_assert_eq!` of
  `write_cas_file_prehashed`. That's cheap per-call but on an install
  hot path across a whole dep tree it showed up as measurable debug-
  build overhead. The unchecked helper skips the assert; the public
  `write_cas_file_prehashed` keeps it for the external-caller-with-
  its-own-hash case.

* Add two unit tests for `write_cas_file_prehashed`:
  - `write_cas_file_prehashed_parity_with_write_cas_file`: same bytes
    through both entry points land at the same CAFS path and produce
    the same on-disk contents, for both `executable=false` and
    `executable=true`.
  - `write_cas_file_prehashed_debug_asserts_hash_matches_buffer`:
    `#[should_panic]` locks in the foot-gun guard; gated on
    `debug_assertions` because the assert is compiled out in release.

* Replace the stale TODO above `package_integrity.to_string()`.
  The old comment talked about "Cloning here is less than desirable"
  and suggested `Arc::clone`, but after the streaming refactor this
  site no longer clones the `Integrity`; it serializes it to a
  `String` for the `index.db` key. The new comment reflects what the
  allocation is actually for and what would cut it (an
  `Arc<str>` field on `DownloadTarballToStore`) if a profile ever
  demanded it.

* Make tokio's `io-util` feature explicit at the workspace level.
  `AsyncWriteExt` / `AsyncReadExt` are feature-gated; we were
  getting them via whatever transitive crate happened to enable
  them (`reqwest`'s `stream` feature, most likely). That's fragile
  across dep-graph changes. Now declared as a first-class feature
  on the workspace tokio line alongside `rt`, `rt-multi-thread`,
  and `macros`.
@zkochan zkochan requested a review from Copilot April 24, 2026 18:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/tarball/src/lib.rs Outdated
Comment on lines +541 to +549
// the compressed bytes into an SHA-512 `IntegrityChecker`.
// Compared to the old "buffer-everything, hash, decompress"
// pipeline this removes one full pass over the compressed body
// (no separate `Integrity::check(&response)` call), never holds
// both the compressed and decompressed buffers at the same
// time, and lets decompression overlap with `await`s on the
// network. The blocking tar iterator still runs inside
// `spawn_blocking` below so per-file SHA-512 + CAFS writes
// stay off the reactor.
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar/accuracy nit in the comment: “teeing the compressed bytes into an SHA-512 IntegrityChecker” should be “a …”, and IntegrityChecker isn’t inherently SHA-512 (it follows the algorithms present in Integrity). Consider rewording to avoid implying a fixed hash algorithm.

Suggested change
// the compressed bytes into an SHA-512 `IntegrityChecker`.
// Compared to the old "buffer-everything, hash, decompress"
// pipeline this removes one full pass over the compressed body
// (no separate `Integrity::check(&response)` call), never holds
// both the compressed and decompressed buffers at the same
// time, and lets decompression overlap with `await`s on the
// network. The blocking tar iterator still runs inside
// `spawn_blocking` below so per-file SHA-512 + CAFS writes
// stay off the reactor.
// the compressed bytes into an `IntegrityChecker` for the
// expected integrity algorithms. Compared to the old
// "buffer-everything, hash, decompress" pipeline this removes
// one full pass over the compressed body (no separate
// `Integrity::check(&response)` call), never holds both the
// compressed and decompressed buffers at the same time, and
// lets decompression overlap with `await`s on the network. The
// blocking tar iterator still runs inside `spawn_blocking`
// below so per-file SHA-512 + CAFS writes stay off the reactor.

Copilot uses AI. Check for mistakes.
Comment thread crates/store-dir/src/cas_file.rs Outdated
Comment on lines +74 to +97
/// Same as [`write_cas_file`](Self::write_cas_file) but takes a
/// pre-computed SHA-512 so the caller can hash the bytes as they
/// arrive (e.g. chunk-by-chunk out of a tar entry) and skip the
/// separate `Sha512::digest(buffer)` pass inside this function.
///
/// The caller is responsible for ensuring `file_hash` is the
/// SHA-512 of `buffer`; passing a mismatched hash would corrupt
/// the CAFS (the blob would land at a path that advertises a
/// digest its contents don't satisfy). A `debug_assert` catches
/// that misuse in debug builds / tests; release builds trust the
/// caller so the write path stays single-pass.
pub fn write_cas_file_prehashed(
&self,
buffer: &[u8],
file_hash: FileHash,
executable: bool,
) -> Result<PathBuf, WriteCasFileError> {
debug_assert_eq!(
Sha512::digest(buffer),
file_hash,
"write_cas_file_prehashed called with a hash that is not Sha512(buffer); \
passing a mismatched hash would silently corrupt the CAFS",
);
self.write_cas_file_unchecked(buffer, file_hash, executable)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_cas_file_prehashed is safe to call but relies on the caller to provide the correct hash (the guard is only a debug_assert, so release builds will trust the input). Since a mismatch would place bytes under the wrong digest path, consider making the “trusted/unchecked” nature more explicit in the API (e.g., naming) to reduce accidental misuse by other internal callers.

Copilot uses AI. Check for mistakes.
@KSXGitHub
Copy link
Copy Markdown
Contributor

KSXGitHub commented Apr 24, 2026

Integrated-Benchmark Report (Linux)

Scenario: Frozen Lockfile

Command Mean [s] Min [s] Max [s] Relative
pacquet@HEAD 2.983 ± 0.083 2.834 3.144 1.15 ± 0.04
pacquet@main 2.603 ± 0.060 2.535 2.706 1.00
pnpm 5.900 ± 0.068 5.738 6.003 2.27 ± 0.06

HEAD (this PR) is 1.15 times slower according to the latest report, I wonder what happened?

@zkochan
Copy link
Copy Markdown
Member Author

zkochan commented Apr 24, 2026

Update: the streaming pipeline has been reverted in bab8b40. The PR now keeps only the two changes that unambiguously help.

What's left in the PR

  1. Per-entry hash interleaveStoreDir::write_cas_file_prehashed, plus the chunked read+hash loop in extract_tarball_entries. One pass over decompressed entry bytes instead of two.
  2. Compressed-side tee hash — still stream the response via bytes_stream() and feed each chunk into ssri::IntegrityChecker so the per-tarball SHA-512 is computed during the download instead of in a separate Integrity::check(&buffer) pass afterward.

Compressed bytes are still buffered fully before decompression; zune-inflate still runs sync inside spawn_blocking, exactly like pre-PR. Shape now matches pnpm's crypto.hashgunzipSync ordering.

Why I dropped the async gzip pipeline

Toxiproxy-injected latency runs (integrated-benchmark frozen-lockfile, 1352-snapshot fixture, local M3):

latency main async-compression (previous PR head) narrow-revert (this PR head)
0 ms (loopback) 2.60 ± 0.06 s (CI) 2.98 ± 0.08 s (CI) expect parity — CI will re-run
50 ms 16.70 ± 0.78 s 18.94 ± 1.06 s 16.54 ± 1.25 s
200 ms 22.37 ± 0.42 s 22.58 ± 0.55 s not measured

The async-compression version forced the gzip backend to flate2/miniz_oxide (the only gzip feature that crate ships), which is noticeably slower on decompress than zune-inflate on Apple Silicon / commodity x86 (cf. investigations/pacquet-macos-perf.md §7). At every latency tier I could reproduce, the decompressor regression dominated the network/decompress overlap gain. The gap shrank with rising latency but never flipped to a win — the break-even point is somewhere around 200 ms, past the typical CDN / same-continent range.

The narrow revert recovers parity with main on the 50 ms run (within noise) while keeping the two cheap wins. A block comment at the streaming site documents these numbers so the next person who reaches for async-compression doesn't have to re-derive the regression.

Follow-up

A separate small PR will teach integrated-benchmark a --latency-ms flag (auto-launch toxiproxy + attach a latency toxic) so every future streaming/fetching change gets realistic numbers in CI, not just 0 ms loopback.

zkochan added 7 commits April 24, 2026 23:03
Clarify the scope of the cardinal "match pnpm exactly" rule: it
governs observable behavior (CLI flags, defaults, error codes,
`.npmrc` semantics, lockfile format, CAS/store-index layout) and
not internal implementation detail. Pipeline topology, async vs.
sync, streaming vs. buffering, hash-tee vs. post-hoc hashing,
thread-pool shape, chunk sizes, and parallelism strategy may
diverge when the divergence delivers a measured perf win and
leaves observable outputs byte-identical.

Require any such divergence to carry a `Why we diverge:` note so
it stays auditable and reversible if upstream later adopts a
different approach.
Two perf fixes from investigations/pacquet-macos-perf.md that both
collapse redundant passes over the tarball bytes.

1. Stream the HTTP body through the gzip decoder (doc fix #4).
   Replace `response.bytes().await` + `decompress_gzip` (sync
   zune-inflate) + separate `Integrity::check(&buffer)` with a
   streaming pipeline: `bytes_stream()` teed through
   `ssri::IntegrityChecker` into `async_compression::tokio::write::
   GzipDecoder`. This removes one full pass over the compressed body
   (no separate `.check()`), never holds compressed + decompressed
   buffers at the same time, and lets decompression overlap with
   network awaits. Integrity is still verified end-to-end before any
   CAFS blob is written, so a mismatch aborts the install with the
   same `TarballError::Checksum` behavior as before.

2. Interleave SHA-512 with the per-entry read (doc fix #2). Add
   `StoreDir::write_cas_file_prehashed`; the tar extraction loop
   reads + hashes 64 KiB chunks in one pass instead of
   `read_to_end` followed by a separate `Sha512::digest(buffer)`
   inside `write_cas_file`. The buffer still has to exist (CAFS
   filename is hash-derived), but two passes over the decompressed
   entry bytes collapse into one. Mirrors pnpm's `parseTarball` +
   `addBufferToCafs` in `store/cafs/src/addFilesFromTarball.ts`.

Why we diverge from pnpm on (1): upstream
`fetching/tarball-fetcher/src/remoteTarballFetcher.ts` +
`worker/src/start.ts` + `store/cafs/src/addFilesFromTarball.ts`
buffers the whole tarball into a Node `SharedArrayBuffer` and runs
`crypto.hash` + `zlib.gunzipSync` sequentially. Observable behavior
here is identical — same integrity errors on mismatch, same
decompressed bytes, same CAFS layout, same `index.db` rows — so per
the "Internal performance divergence is allowed" clause added in the
preceding commit, the streaming topology is a pacquet-specific
implementation-detail optimization rather than a shape port.

Dep changes: swap `zune-inflate` for `async-compression` (gzip +
tokio features) + `futures-util`; enable reqwest's `stream` feature;
add `sha2` to the tarball crate.

Tests: existing `packages_under_orgs_should_work` (real-network
end-to-end) and `should_throw_error_on_checksum_mismatch` both pass
unchanged, proving the streaming pipeline produces the same
decompressed bytes and fails integrity checks the same way.
Three comments, all actionable:

* `write_cas_file_prehashed` silently corrupts the CAFS if a caller
  passes a hash that isn't `Sha512(buffer)` — the blob lands at a
  path whose filename advertises a digest its contents don't satisfy.
  Release builds trust the caller so the single-pass hot path stays
  fast; `debug_assert_eq!(Sha512::digest(buffer), file_hash)` catches
  the misuse in tests and debug builds.

* The chunked read loop inside `extract_tarball_entries` extended
  `buffer` via `extend_from_slice`, which falls back to `Vec::reserve`
  and aborts on allocation failure. The initial `try_reserve(
  prealloc_hint)` was the whole point of the OOM-safety guard — the
  hint is clamped to 64 MiB even when the tar header claims more, so
  an entry whose real size exceeds the clamp MUST keep allocating
  without aborting. Call `buffer.try_reserve(n)` before each extend
  and propagate failure as `TarballError::ReadTarballEntries`
  (matching the prealloc error shape).

* The streaming download bailed on the first `decoder.write_all`
  error, returning `DecodeGzip` for a corrupt-gzip-with-mismatched-
  integrity tarball. pnpm's upstream pipeline (`worker/src/start.ts`
  → `crypto.hash` then `gunzipSync`) hashes the compressed bytes
  before touching zlib, so the same input surfaces as an integrity
  failure there — and error codes are part of the public contract
  per AGENTS.md "errors and diagnostics". Keep feeding the
  `IntegrityChecker` with every chunk regardless of decoder state,
  then run `checker.result()` first. Only after integrity passes do
  we surface a stored decode error (or call `decoder.shutdown` to
  finalize a cut-short footer).
Copilot follow-up review on #278. `package_unpacked_size` comes from
registry metadata — untrusted — and flowed straight into
`Vec::with_capacity(size)`. A malicious or corrupt `unpackedSize` of
`usize::MAX` would turn the prealloc into an immediate abort before
the first byte of the tarball even arrived, no I/O error surfaceable.

Treat the hint as best-effort: clamp it to the same 64 MiB ceiling
used for `MAX_ENTRY_PREALLOC_BYTES` and reserve fallibly with
`try_reserve`, dropping the reservation on allocation failure so the
`Vec` simply falls back to geometric growth. Tarballs whose real
decompressed size exceeds the clamp still work, they just don't get
the prealloc benefit.

Note: real-world decompressed tarballs are under 64 MiB for > 99%
of npm packages (typescript at ~60 MiB sits near the top). The per-
tarball peak memory under concurrency = post-download-semaphore cap
× 64 MiB already bounds our in-flight RSS; this cap keeps that bound
from inflating on a bad manifest.
Four comments, all actionable:

* `write_cas_file` now routes through a private
  `write_cas_file_unchecked` instead of `write_cas_file_prehashed`.
  The only change for the pnpm-facing API is internal: before this
  commit, every `write_cas_file` call on a debug/test build would
  re-hash its own buffer inside the `debug_assert_eq!` of
  `write_cas_file_prehashed`. That's cheap per-call but on an install
  hot path across a whole dep tree it showed up as measurable debug-
  build overhead. The unchecked helper skips the assert; the public
  `write_cas_file_prehashed` keeps it for the external-caller-with-
  its-own-hash case.

* Add two unit tests for `write_cas_file_prehashed`:
  - `write_cas_file_prehashed_parity_with_write_cas_file`: same bytes
    through both entry points land at the same CAFS path and produce
    the same on-disk contents, for both `executable=false` and
    `executable=true`.
  - `write_cas_file_prehashed_debug_asserts_hash_matches_buffer`:
    `#[should_panic]` locks in the foot-gun guard; gated on
    `debug_assertions` because the assert is compiled out in release.

* Replace the stale TODO above `package_integrity.to_string()`.
  The old comment talked about "Cloning here is less than desirable"
  and suggested `Arc::clone`, but after the streaming refactor this
  site no longer clones the `Integrity`; it serializes it to a
  `String` for the `index.db` key. The new comment reflects what the
  allocation is actually for and what would cut it (an
  `Arc<str>` field on `DownloadTarballToStore`) if a profile ever
  demanded it.

* Make tokio's `io-util` feature explicit at the workspace level.
  `AsyncWriteExt` / `AsyncReadExt` are feature-gated; we were
  getting them via whatever transitive crate happened to enable
  them (`reqwest`'s `stream` feature, most likely). That's fragile
  across dep-graph changes. Now declared as a first-class feature
  on the workspace tokio line alongside `rt`, `rt-multi-thread`,
  and `macros`.
An earlier commit in this PR routed the response body through
`async_compression::tokio::write::GzipDecoder` so network fetch and
gzip inflation could overlap. That forced the gzip backend to
`flate2`/`miniz_oxide` (the only gzip feature async-compression
offers) and turned out to regress wall-time at every latency tier
I could reproduce locally.

Toxiproxy-injected latency, integrated-benchmark frozen-lockfile
scenario, 1352-snapshot fixture, M3 / APFS:

    0 ms (loopback)  main 2.60 s  vs  streaming 2.98 s  (15% slower)
    50 ms (typical)  main 16.70 s vs  streaming 18.94 s (13% slower)
    200 ms (WAN)     main 22.37 s vs  streaming 22.58 s (break-even)

The overlap gain does scale with latency, but it never overtakes
the decompressor regression at any realistic latency — `miniz_oxide`
is slow enough on the decompress side that matching pnpm upstream's
"buffer body, then `gunzipSync`" topology wins on wall-time with the
current decompressor set.

Revert to:

* `bytes_stream()` drained chunk-by-chunk so we can tee each chunk
  through the `IntegrityChecker` — this is the one piece of the
  streaming pipeline that pays for itself. Saves one full pass over
  the compressed body vs the pre-PR `response.bytes().await` +
  `package_integrity.check(&response)`.
* Buffer the compressed bytes fully, then decompress with
  `zune-inflate` sync inside `spawn_blocking`, exactly as the
  pre-PR code did.
* Integrity is verified before we touch gzip, matching pnpm's
  `crypto.hash` → `gunzipSync` ordering.

Keeps the per-entry hash interleave from the earlier commit (that
one is unambiguously a win and scales with file count, not latency).

Removes `async-compression` (workspace + crate dep) and tokio's
`io-util` feature (no other caller in the workspace); restores
`zune-inflate` as the decompressor.

Code change includes a block comment at the streaming site that
cites these numbers so the next person who reaches for
async-compression doesn't have to re-derive the regression.
Hyperfine runs its measured commands sequentially (no per-iteration
interleaving). With `--warmup 1`, the second-listed command's
measured runs always start from a state that the first command's 10
measured runs already warmed — OS page cache, verdaccio's on-disk
cache, registry-mock in-memory state, inode cache, everything.

This shows up as a systematic 1-3% advantage for whichever command
is listed second in every PR's benchmark run, regardless of the
code difference. #278's CI surfaced this clearly: three successive
rebases all reported main slightly ahead of the PR HEAD (1.03× →
1.15× range depending on how much real signal was in the change),
and it replicated even for the current commit where the diff is
tiny. The workflow passes `HEAD main`, so `main` is always the
second command, and `main` always appears slightly faster.

Fix it by running each benchmarked install once in a `prewarm`
phase before hyperfine starts. After prewarm, OS + verdaccio +
registry-mock are all fully primed, and hyperfine's `--warmup 1`
per command is just the final touch rather than the sole source of
state warmup. Both commands' measured runs then start from the
same warm state, and any remaining difference reflects the code
under test.

Cost: one extra install per revision (plus pnpm if
`--with-pnpm`). At ~2.6s each on CI, that's ≤10s of added bench
time for a three-way compare — cheap for reliable numbers.

Refactors `cleanup_command()` out of `benchmark()` so `prewarm()`
and `benchmark()` share the exact same rm-rf between iterations,
keeping the state machine symmetric.
zkochan added 4 commits April 24, 2026 23:58
Temporary diagnostic for #278. The last two CI bench runs showed
`main` ~3-4% ahead of `HEAD`, but the prewarm pass from the
previous commit should have eliminated cold-cache ordering bias —
yet the ratio was essentially unchanged. That leaves two
competing hypotheses:

* systematic benchmark-framework bias favouring the second-listed
  hyperfine command (CPU scheduling, measurement overhead, runner
  variance), or
* a real ~3-4% regression in the remaining PR changes.

Flip the argument order from `HEAD main` to `main HEAD` so `main`
is now the first-listed command. If the next CI run shows `HEAD`
winning by a similar margin, it's framework bias and we'll follow
up with a proper randomization. If `main` still wins, the PR has
real overhead and we revert the per-entry hash interleave.

Revert this diff once the signal is unambiguous.
Fix #2 from `investigations/pacquet-macos-perf.md` turned out to
regress wall-time at 0ms loopback by ~2% across three successive
CI runs on #278, even after controlling for hyperfine-ordering
bias (the previous commit swapped `HEAD main` → `main HEAD`; main
stayed ahead by 2%, so the bias hypothesis explains only ~half the
gap). Fix #2 is the most likely culprit — it hits every tar entry
(tens of thousands across the 1352-snapshot fixture) and replaces
an optimized `Vec::read_to_end` + single `Sha512::digest` with a
chunked read loop that does per-chunk `try_reserve`,
`hasher.update`, and `extend_from_slice`. The streaming
per-chunk branches add overhead that the saved second pass doesn't
recover at the scale of small npm entries (typical entry body
is a few KiB — way below where a second scan of the already-hot
cache starts mattering).

Revert to the pre-PR shape on the entry loop:

    let mut buffer = Vec::new();
    buffer.try_reserve(prealloc_hint).map_err(...)?;
    entry.read_to_end(&mut buffer).map_err(...)?;
    let (file_path, file_hash) = store_dir
        .write_cas_file(&buffer, file_is_executable)
        .map_err(...)?;

Remove `write_cas_file_prehashed` and the private
`write_cas_file_unchecked` helper from `StoreDir` — nothing calls
them now. Also drop the two prehashed-specific tests
(`…_parity_with_write_cas_file` and
`…_debug_asserts_hash_matches_buffer`) and the tarball crate's
`sha2` dep, since the one-shot `write_cas_file` on
`pacquet-store-dir` is the only consumer of SHA-512 once more.

What's still left in the PR after this revert:

* compressed-side tee-hash: `bytes_stream()` + per-chunk
  `IntegrityChecker::input` eliminates the separate
  `Integrity::check(&buffer)` pass after buffering. Chunk count
  here is tiny (a handful per tarball) so per-chunk overhead is
  negligible, and the payoff (skipping one full SHA-512 over the
  compressed body) scales with tarball size.
* integrity check ordering before decompression, matching pnpm's
  `crypto.hash` → `gunzipSync` sequence.
* AGENTS.md's "internal perf divergence is allowed" clause.

Next CI bench will tell us whether the remaining tee-hash alone
is clean or whether the regression persists and the PR should
close.
Reverts 2c1e83b. That commit flipped the argument order as a
one-off probe to disambiguate hyperfine ordering bias from real
regression on #278 — it served its diagnostic purpose (confirmed
~1-2% second-command-wins bias plus ~2% real overhead in the
since-reverted per-entry hash interleave) and should not land on
main. Put the workflow back to `HEAD main` so historical bench
comparisons across PRs stay apples-to-apples.
Reverts 0f46687. The prewarm phase was added on this PR to test
whether cold-cache ordering was driving the "main is a bit faster
in every PR" pattern. It wasn't — the pre-warmed run showed the
same 1.04× ratio as without. The ordering bias we eventually
pinned down (~1-2% advantage for whichever command hyperfine runs
second) lives somewhere hyperfine's own plumbing, not in the
shared state the prewarm covered.

Since the prewarm didn't solve what it was added for and this
PR's scope is a tarball-side perf fix, drop it. A separate PR can
tackle the framework bias properly (randomizing order, averaging
over multiple invocations, or switching to a tool that
interleaves) without piggybacking on an unrelated review.
@zkochan zkochan changed the title perf(tarball): stream HTTP body into gzip; interleave per-entry hash perf(tarball): tee integrity hash on the download stream Apr 24, 2026
zkochan added a commit to kdy1/pacquet that referenced this pull request Apr 24, 2026
The previous commit on this branch picked `swc_malloc`, a
meta-crate that selects mimalloc on macOS / Windows and jemalloc
on Linux at compile time. The first CI integrated-benchmark run
(pnpm#188 at 395a9d2) came back with `pacquet@HEAD` 2% slower than
`pacquet@main` — a 57 ms gap inside noise bands (ratio 1.02 ±
0.04), which is consistent with the 1-2% "second-command wins"
ordering bias we measured separately on pnpm#278.

That reading tells us jemalloc on this Linux runner (glibc 2.35+)
gives us nothing measurable over the system allocator. Before
closing pnpm#188, run one more comparison with mimalloc instead of
jemalloc — mimalloc has a different allocator philosophy
(per-thread free lists + small-object fast path) and is a
stronger fit for the "thousands of short-lived small
allocations" shape a package manager produces.

Diff is minimal:
  * `swc_malloc` → `mimalloc` in workspace + `pacquet-cli` deps.
  * `extern crate swc_malloc;` → explicit
    `#[global_allocator] static GLOBAL: mimalloc::MiMalloc =
    mimalloc::MiMalloc;` in `crates/cli/src/lib.rs`.
  * Comment rewritten to describe what this allocator actually is
    rather than the per-target selection swc_malloc was doing.

If the next CI bench shows mimalloc also within noise of the
system allocator, this PR closes — the allocator swap doesn't
deliver on the Linux runner we measure against. If mimalloc
wins, we land the direct-mimalloc form.

Co-authored-by: 강동윤 (Donny) <kdy1997.dev@gmail.com>
@zkochan zkochan mentioned this pull request Apr 24, 2026
@zkochan
Copy link
Copy Markdown
Member Author

zkochan commented Apr 24, 2026

It doesn't seem like any of these changes improve performance. I am closing it.

@zkochan zkochan closed this Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants