Some feedback by steveklabnik · Pull Request #2 · pnpm/pacquet

steveklabnik · 2023-07-12T23:43:22Z

Okay! So! Here I am!

The very first thing I want to say is, this code was already really great. If you were a random beginner who asked me to look at their code, I would say "great job" and not offer any of this feedback unless they were looking to actively level up at Rust. And while we don't know each other well, I do know that you're an experienced systems programmer, and looking to level up your Rust.

So I went a little extra. Sorry not sorry.

Each of these commits tells a story, in order. I left detailed commit messages. Some code gets re-written multiple times, but each step should be instructive, so if you only look at the final diff, you won't learn everything. Please take what you want, and throw away what you don't. And I am happy to discuss anything about any of this in detail.

This looks like a fun project!

steveklabnik · 2023-07-12T23:45:42Z

Ooh, this will have to be rebased on top of #1. I am happy to do that! But since it will involve changing stuff sort of in the middle of the series, I'm going to wait until you review what's here, and then if you have any feedback, I can handle both the rebase and the feedback at the same time.

steveklabnik · 2023-07-12T23:48:10Z

(After reading over that code, I have some stuff that will make your life easier over there too. But again, I'll wait for that.)

anonrig

Amazing work. Thank you for your feedback. I've had some questions, but greatly happy to see this. One of my last question is, how the usage of .context is only exist in cli and not in other crates. Should they share the same behavior?

steveklabnik · 2023-07-13T01:31:20Z

I replied to a few things but I have to run; I'll put more later and update the PR as well.

steveklabnik · 2023-07-13T03:06:30Z

Left more comments, I am going to rebase and add more commentary on the stuff from #1 tomorrow.

rluvaton · 2023-07-13T06:30:35Z

I love open source!

Previously, this code said "Hey, I will take anything that can be converted into a &Path," and now it says "Hey I can take anything that can be turned into a PathBuf." The former is not *wrong*, but is a bit roundabout. You want to store a PathBuf. Just take anything that can be turned into a PathBuf. Not only is this a little simpler, but it also can save you an allocation, at least conceputally, I don't know if rustc is smart enough to remove the allocation in code like this: ```rust let path = PathBuf::from("/etc/passwd"); let manager = RegistryManager::new(&path); ``` With the previous version of the code, this would convert a PathBuf into a &Path, by using a pointer to it, and then creates a whole new PathBuf out of it, by allocating. This way instead, we can still take the same stuff, but if we have a PathBuf already, we skip the roundtrip through a pointer. In other words, you could just pass `path` in the code above, whereas you couldn't with the original version.

The *only* thing &String can meaningfully do over &str is examine the capacity of the underlying string, and you aren't doing that. This signature is more general, and more idiomatic.

This code works via a feature called "Deref coercion." Basically: If T implements Deref<Target = U>, and x is a value of type T, then &T is coerced into &U. This means we can pass a &String to a function expecting a &str, and it will Just Work, because String implements Deref<Target=str>. This is also why the previous commit works and is more general; maybe I should have said this there!

format strings are a relatively new addition to Rust, so that's why you probably didn't find them. Also, when creating a String from a string literal, standard style is to use .to_string, and there's a pretty significant minority style (which I personally prefer) to use String::from, but given that you're doing concatenation, this would be what I would personally consider idiomatic Rust post-1.58.

When working with filesystems, especially, code that vaguely looks like this: ``` if !a_file.exists() { do_something_to_create_file() } ``` has a bug, of varying degrees of severity. This is called "Time-of-check to time-of-use," and the problem is roughly: ``` if !a_file.exists() { // what happens if we get here, but before // do_something_to_create_file is called, we // swap to another thread, and another process creates // a_file, possibly by doing something fancy with symlinks do_something_to_create_file() } ``` I don't think there is an actual problem here, probably, but it's better just to "confidently" call these functions, rather than try and get fancy. That way the creation is "atomic," if you will. For more: https://en.wikipedia.org/wiki/Time-of-check_to_time-of-use Now, the commit line says "partially" because ``` if !extract_destination.exists() { let _ = download_and_extract( ``` This is doing the same thing, but modifying all of that code to also be okay with being called this way is a larger lift, and you don't have any tests, so I am nervous about breaking something :)

You're already returning Result, best to not panic. I decided to try and keep this in the same style you currently have, with a String living in the error type, but you could also just stash the io::Error in there too. Also, I didn't make this change in this diff beacuse it is up to you, but I would not be making a new module for the erorr, I would put it in lib.rs. You're not really buying anything by moving it in there, just making yourself import it when you don't have to. Up to you though!

you were even mapping these to a Result, only to call unwrap on them!

You know the drill by now :)

as your application grows here, I think you'll come to appreciate anyhow. A common suggestion is "thiserr for libraries, anyhow for applications," and so we're following that. You're effectively putting main in the library, which isn't bad, but that's why this is not *strictly* following the letter of that advice, but instead, the spirit. .context will give you output that looks like this: ``` Error: current directory should exist Caused by: No such file or directory (os error 2) ``` I also tweaked the error messages, given that it not existing is the only issue. For example, maybe the problem is permissions, in which case, the error would look like this: ``` Error: current directory should exist Caused by: Permission deined (os error 13) ``` which is uh, wrong. So as of this commit, this will instead say ``` Error: problem fetching current directory Caused by: Permission deined (os error 13) ``` or ``` Error: problem fetching current directory Caused by: No such file or directory (os error 2) ``` The other two messages are already pretty good, I'd say.

I think this looks nicer. 100% subjective, of course :)

Okay so this is, in my mind, slightly better error handling. Instead of erasing everything to a String, you're actually holding onto the various error types. This would allow your users to handle errors in a more granular way if they wished. I struggled a bit for naming here. Network IO is still IO, so these feel a bit redundant. However, I suspect separating these two does make sense in some senese, as what you'd do to fix them is different. And merging them into one grand IO error is actually a huge hassle and not helpful for users. So I kept it this way. It's up to taste, really.

as written, this would look stuff up twice, instead, let's do that once.

We don't need an owned version of this to call this method.

Given that you're exporting this enum, you have a choice to make: 1. never add another varaint unless you want to increase your semver major version 2. make it a nonexhasitve enum The latter is a good idea for errors, generally.

So, because we do some work in the conversion of these error messages, *technically* this way is slightly more efficient. You see, with or, we unconditionally also run the "Generate the error" code. But with or_else, we only invoke that code if we actually hit the error here. So in effect, doing this moves the allocation to the error path only, which is nice.

Turns out that now that we're using or_else, clippy recognizes that there's a better method on result, map_err, for this task.

I really prefer using at least unwrap here. If this never happens, they're equivalent, but if it does happen, with the old code, you silently continue, and with the new code, you blow up.

@anonrig

Now that we've implemented From, this is no longer needed. Nice catch @anonrig!

steveklabnik · 2023-07-13T19:53:07Z

I have rebased, made sure each commit builds and passes cargo fmt. I am not gonna roll suggestions for #1 into this PR because I'd prefer to land it and start a new one rather than deal with all of these commits :)

One of my last question is, how the usage of .context is only exist in cli and not in other crates. Should they share the same behavior?

They should not. The answer is kind of long and philosophical. Sorry about that.

context is in anyhow. "anyhow for binaries, thiserror for libraries". But why is that the advice?

binaries tend to catch errors, and libraries tend to produce errors. everything else derives from this.

as a binary, your job is to:

catch any errors that happen
either do something to fix the error (like "retry")
or show the error to the user

as a library, your job is to:

return an error to some code above you
at the public API of your library, give users who are using your library the information they need to handle the error, aka steps 2 and 3 above

So as a library, it's our job to sort out that API. and then present exactly this. so you figure out the broader structure of your errors, and then internally, convert things into those errors where appropriate, and return that error. we're passing these things around so we care about efficiency. this is what thiserror does, it helps you define these sorts of errors, and gives you nice tooling to do so.

as a binary, we take those errors, but generally either do something right away, or throw the error upstream until we print out some information about it. so we tend to care about if we can do something other than blow up at an error, and if we do decide to blow up, a way to figure out where the error came from within our applications' code. also because we are blowing up we don't really care about efficiency; this is happening once and is a cold path. this is what anyhow gives you: a Result type that is super easy to use, and methods for it that help you describe why you're throwing this error upstream.

this is what context is; it produces a one-off error type, in a sense. and that's what we want in our program: a list of contexts of where the error came from. but at some point it comes from some library, where we don't bother reporting more deeply to the user, because that doesn't make sense. the CLI doesn't care specially about the internals of the registry management code, it cares only that something bad happened in there. if the library used .context everywhere, it would be not providing as useful errors to us, because all we'd have is "some sort of error description here" and not a rich set of types we can handle. it also would over-allocate and be slower.

... does all of that make sense? it's a vibe thing on some level for sure.

@weihanglo

thiserror provides a way to generate from implementations. This commit demonstrates using this feature. Nice catch, @weihanglo!

steveklabnik · 2023-07-13T20:08:01Z

cargo deny is failing simply because this is the first time that it's actually being run, because this is the first PR that touches Cargo.toml. If you actually run the command on any of the commits, it always fails.

I do not know how you want to configure the tool, so if you could either add it or let me know, that would get this green and ready to go. :)

anonrig · 2023-07-13T20:17:03Z

I think this is ready to go! Thanks!

nicoddemus · 2023-07-13T22:11:09Z

From a Rust newbie, this was enlightening, great work @steveklabnik! 😁

Also congrats to @anonrig on the project. 👍

coreyja · 2023-07-16T19:03:46Z

Echoing what others said! This PR and explanations were great @steveklabnik and thanks @anonrig for sharing the repo so it could happen here!

I did a stream today where I walked through the PR and discussed the changes and each commit. Basically just a video walkthrough of this PR if anyone is interested:
https://www.twitch.tv/videos/1873539226

Two perf fixes from investigations/pacquet-macos-perf.md that both collapse redundant passes over the tarball bytes. 1. Stream the HTTP body through the gzip decoder (doc fix #4). Replace `response.bytes().await` + `decompress_gzip` (sync zune-inflate) + separate `Integrity::check(&buffer)` with a streaming pipeline: `bytes_stream()` teed through `ssri::IntegrityChecker` into `async_compression::tokio::write:: GzipDecoder`. This removes one full pass over the compressed body (no separate `.check()`), never holds compressed + decompressed buffers at the same time, and lets decompression overlap with network awaits. Integrity is still verified end-to-end before any CAFS blob is written, so a mismatch aborts the install with the same `TarballError::Checksum` behavior as before. 2. Interleave SHA-512 with the per-entry read (doc fix #2). Add `StoreDir::write_cas_file_prehashed`; the tar extraction loop reads + hashes 64 KiB chunks in one pass instead of `read_to_end` followed by a separate `Sha512::digest(buffer)` inside `write_cas_file`. The buffer still has to exist (CAFS filename is hash-derived), but two passes over the decompressed entry bytes collapse into one. Mirrors pnpm's `parseTarball` + `addBufferToCafs` in `store/cafs/src/addFilesFromTarball.ts`. Why we diverge from pnpm on (1): upstream `fetching/tarball-fetcher/src/remoteTarballFetcher.ts` + `worker/src/start.ts` + `store/cafs/src/addFilesFromTarball.ts` buffers the whole tarball into a Node `SharedArrayBuffer` and runs `crypto.hash` + `zlib.gunzipSync` sequentially. Observable behavior here is identical — same integrity errors on mismatch, same decompressed bytes, same CAFS layout, same `index.db` rows — so per the "Internal performance divergence is allowed" clause added in the preceding commit, the streaming topology is a pacquet-specific implementation-detail optimization rather than a shape port. Dep changes: swap `zune-inflate` for `async-compression` (gzip + tokio features) + `futures-util`; enable reqwest's `stream` feature; add `sha2` to the tarball crate. Tests: existing `packages_under_orgs_should_work` (real-network end-to-end) and `should_throw_error_on_checksum_mismatch` both pass unchanged, proving the streaming pipeline produces the same decompressed bytes and fails integrity checks the same way.

Fix #2 from `investigations/pacquet-macos-perf.md` turned out to regress wall-time at 0ms loopback by ~2% across three successive CI runs on #278, even after controlling for hyperfine-ordering bias (the previous commit swapped `HEAD main` → `main HEAD`; main stayed ahead by 2%, so the bias hypothesis explains only ~half the gap). Fix #2 is the most likely culprit — it hits every tar entry (tens of thousands across the 1352-snapshot fixture) and replaces an optimized `Vec::read_to_end` + single `Sha512::digest` with a chunked read loop that does per-chunk `try_reserve`, `hasher.update`, and `extend_from_slice`. The streaming per-chunk branches add overhead that the saved second pass doesn't recover at the scale of small npm entries (typical entry body is a few KiB — way below where a second scan of the already-hot cache starts mattering). Revert to the pre-PR shape on the entry loop: let mut buffer = Vec::new(); buffer.try_reserve(prealloc_hint).map_err(...)?; entry.read_to_end(&mut buffer).map_err(...)?; let (file_path, file_hash) = store_dir .write_cas_file(&buffer, file_is_executable) .map_err(...)?; Remove `write_cas_file_prehashed` and the private `write_cas_file_unchecked` helper from `StoreDir` — nothing calls them now. Also drop the two prehashed-specific tests (`…_parity_with_write_cas_file` and `…_debug_asserts_hash_matches_buffer`) and the tarball crate's `sha2` dep, since the one-shot `write_cas_file` on `pacquet-store-dir` is the only consumer of SHA-512 once more. What's still left in the PR after this revert: * compressed-side tee-hash: `bytes_stream()` + per-chunk `IntegrityChecker::input` eliminates the separate `Integrity::check(&buffer)` pass after buffering. Chunk count here is tiny (a handful per tarball) so per-chunk overhead is negligible, and the payoff (skipping one full SHA-512 over the compressed body) scales with tarball size. * integrity check ordering before decompression, matching pnpm's `crypto.hash` → `gunzipSync` sequence. * AGENTS.md's "internal perf divergence is allowed" clause. Next CI bench will tell us whether the remaining tee-hash alone is clean or whether the regression persists and the PR should close.

- Extract `snapshot_cache_key` helper for the `(integrity, pkg_id)` → cache-key derivation. Both call sites (the upfront `cache_keys` build and the warm/cold partition loop) now go through the same helper, so a future change to the resolution-type handling or key shape stays in one place. A drift between the two loops would silently misclassify warm entries as cold and quietly halve install speed. - Document the intentional `.max(4)` floor in `configure_rayon_pool`. Even on a quota-limited 1-2-CPU runner, dropping below 4 puts rayon back into the regime where one thread blocks on `clonefile` while the next ready snapshot can't start; the metadata-journal bottleneck doesn't shrink with quota, so a small intentional oversubscription is the better trade. Both the function-level doc and an inline comment by the `.max(4)` call explain the choice.

#1 (parse_shebang args leading whitespace): `splitn(2, [' ', '\t'])` discarded the whitespace separator, so an upstream-format `#!/bin/sh -e` parsed to `args="-e"` and the rendered shim text emitted one space between prog and the first flag where upstream emits two. Replace `splitn` with a manual `find([' ', '\t'])` + `split_at` that keeps the separator in the args slice, matching upstream's regex group `(.*)$`. Update `parses_direct_shebang` and `parses_env_dash_s_shebang` to assert the leading-space form so the regression direction is pinned. #2 (idempotent skip checked only `.sh` but skipped writes for all three): if a previous install wrote `.sh` correctly but `.cmd`/`.ps1` were missing — older pacquet, partial-write crash — `already_correct` short-circuited and the upgrade path never repaired the siblings. Gate `already_correct` on the `.sh` marker AND the existence of both `.cmd` and `.ps1`. Add `link_bins_rewrites_when_only_sh_flavor_exists` which deletes the two siblings and verifies the second `link_bins` pass re-creates them. Verified by re-breaking the gate temporarily and confirming the new test fails, then reverting. Findings 3, 4, 5, 9 are pending review acceptance; they'll land in a follow-up commit.

anonrig approved these changes Jul 13, 2023

View reviewed changes

weihanglo reviewed Jul 13, 2023

View reviewed changes

Comment thread crates/registry/src/error.rs Outdated

steveklabnik added 20 commits July 13, 2023 14:13

Prefer &str over &String as a parameter

13107eb

The *only* thing &String can meaningfully do over &str is examine the capacity of the underlying string, and you aren't doing that. This signature is more general, and more idiomatic.

Remove unwrap in from_registry

dadbffb

you were even mapping these to a Result, only to call unwrap on them!

Remove the rest of the unwraps from package.rs

b92a12e

You know the drill by now :)

remove unused main function in lib.rs

d0098b3

Use tokio main macro

387288c

I think this looks nicer. 100% subjective, of course :)

deduplicate a hashmap lookup

ebde746

as written, this would look stuff up twice, instead, let's do that once.

Remove unneeded allocation

77e8554

We don't need an owned version of this to call this method.

Clippy suggestion: nonexhastive enum

cb1f74c

Given that you're exporting this enum, you have a choice to make: 1. never add another varaint unless you want to increase your semver major version 2. make it a nonexhasitve enum The latter is a good idea for errors, generally.

Another opportunity to use format strings for clarity

edc6089

or_else -> map_err

cc9c0e2

Turns out that now that we're using or_else, clippy recognizes that there's a better method on result, map_err, for this task.

unwrap instead of let _

909d8ad

I really prefer using at least unwrap here. If this never happens, they're equivalent, but if it does happen, with the old code, you silently continue, and with the new code, you blow up.

two more or -> map_err

9387919

steveklabnik force-pushed the suggestions branch from 2643ef5 to 9387919 Compare July 13, 2023 19:21

Remove unneeded conversion

5d8927c

Now that we've implemented From, this is no longer needed. Nice catch @anonrig!

anonrig reviewed Jul 13, 2023

View reviewed changes

Comment thread crates/registry/src/error.rs Outdated

steveklabnik added 2 commits July 13, 2023 14:56

Use #[from]

4614358

thiserror provides a way to generate from implementations. This commit demonstrates using this feature. Nice catch, @weihanglo!

remove rayon

f8f2d93

steveklabnik force-pushed the suggestions branch from 521816a to f8f2d93 Compare July 13, 2023 19:57

anonrig merged commit 3f6e0ae into pnpm:main Jul 13, 2023

steveklabnik deleted the suggestions branch July 13, 2023 22:14

zkochan mentioned this pull request Apr 24, 2026

perf(tarball): tee integrity hash on the download stream #278

Closed

5 tasks

KSXGitHub mentioned this pull request Apr 30, 2026

feat: bin #333

Draft

zkochan mentioned this pull request May 7, 2026

Lifecycle script subsystem: gaps vs upstream pnpm #397

Open

KSXGitHub mentioned this pull request May 7, 2026

feat: .modules.yaml #332

Merged

3 tasks

Conversation

steveklabnik commented Jul 12, 2023

Uh oh!

steveklabnik commented Jul 12, 2023

Uh oh!

steveklabnik commented Jul 12, 2023

Uh oh!

anonrig left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

steveklabnik commented Jul 13, 2023

Uh oh!

steveklabnik commented Jul 13, 2023

Uh oh!

rluvaton commented Jul 13, 2023

Uh oh!

Uh oh!

Uh oh!

steveklabnik commented Jul 13, 2023

Uh oh!

steveklabnik commented Jul 13, 2023

Uh oh!

anonrig commented Jul 13, 2023

Uh oh!

nicoddemus commented Jul 13, 2023

Uh oh!

coreyja commented Jul 16, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants