perf: Stream file contents during hashing to lower memory usage#12059
Merged
Conversation
Both hash_file (gix path) and git_like_hash_file (manual fallback) previously read entire files into memory before hashing. For large files hashed in parallel on rayon, this could cause OOM on memory-constrained CI runners. Now both paths stat the file for its size, write the git blob header, then stream through a 64KB BufReader. Peak memory per hash call is bounded regardless of file size. Hash output is identical — verified against git hash-object.
Contributor
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
- Propagate metadata error with ? instead of silently falling back to size 0, which would produce an incorrect blob hash header. - Use std::fs::write in test to correctly write binary content instead of silently writing empty files via str::from_utf8 fallback.
Contributor
Coverage Report
|
github-actions Bot
added a commit
that referenced
this pull request
Feb 28, 2026
## Release v2.8.13-canary.8 Versioned docs: https://v2-8-13-canary-8.turborepo.dev ### Changes - fix: Exclude peer dependencies from workspace external dep resolution (#12050) (`3a75547`) - test: Port all 15 workspace-configs prysk tests to Rust (#12058) (`55442be`) - release(turborepo): 2.8.13-canary.7 (#12060) (`495afdc`) - perf: Stream file contents during hashing to lower memory usage (#12059) (`f03cdce`) - fix: Treat `npm: alias` dependencies as external, not workspace references (#12061) (`b179cb8`) - test: Port 18 more prysk tests to Rust (other/ + lockfile-aware-caching/) (#12062) (`7887af2`) --------- Co-authored-by: Turbobot <turbobot@vercel.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
hash_file(gix path) andgit_like_hash_file(manual fallback) previously calledstd::fs::read()/read_to_end(), loading entire files into memory before hashing. When rayon parallelizes hashing across many large files, this can OOM memory-constrained environments.BufReader. Peak memory per hash call is bounded regardless of file size.git hash-object.What changed
crates/turborepo-scm/src/hash_object.rs—hash_file()now usesgix_index::hash::hasher+gix_object::encode::loose_headerto build the hasher with the blob header, then streams viaBufReaderinstead ofstd::fs::read.crates/turborepo-scm/src/manual.rs—git_like_hash_file()writes the blob header using the file size from metadata, then streams through thesha1::Sha1hasher in 64KB chunks instead ofread_to_end.Testing
test_blob_hash_matches_git_hash_objectwith 128KB (multi-buffer) and 64KB (exact-buffer-boundary) cases.test_manual_hash_matches_git_hash_object— the manual path previously had no test verifying hash correctness againstgit hash-object. This new test covers the same edge cases including streaming buffer boundaries.