Skip to content

perf: Stream file contents during hashing to lower memory usage#12059

Merged
anthonyshew merged 2 commits into
mainfrom
anthonyshew/streaming-file-hash
Feb 28, 2026
Merged

perf: Stream file contents during hashing to lower memory usage#12059
anthonyshew merged 2 commits into
mainfrom
anthonyshew/streaming-file-hash

Conversation

@anthonyshew
Copy link
Copy Markdown
Contributor

@anthonyshew anthonyshew commented Feb 28, 2026

Summary

  • Both hash_file (gix path) and git_like_hash_file (manual fallback) previously called std::fs::read() / read_to_end(), loading entire files into memory before hashing. When rayon parallelizes hashing across many large files, this can OOM memory-constrained environments.
  • Now both paths stat the file for its size, write the git blob header into the hasher, then stream through a 64KB BufReader. Peak memory per hash call is bounded regardless of file size.
  • Hash output is identical — verified by tests comparing against git hash-object.

What changed

crates/turborepo-scm/src/hash_object.rshash_file() now uses gix_index::hash::hasher + gix_object::encode::loose_header to build the hasher with the blob header, then streams via BufReader instead of std::fs::read.

crates/turborepo-scm/src/manual.rsgit_like_hash_file() writes the blob header using the file size from metadata, then streams through the sha1::Sha1 hasher in 64KB chunks instead of read_to_end.

Testing

  • Extended test_blob_hash_matches_git_hash_object with 128KB (multi-buffer) and 64KB (exact-buffer-boundary) cases.
  • Added test_manual_hash_matches_git_hash_object — the manual path previously had no test verifying hash correctness against git hash-object. This new test covers the same edge cases including streaming buffer boundaries.

Both hash_file (gix path) and git_like_hash_file (manual fallback) previously
read entire files into memory before hashing. For large files hashed in
parallel on rayon, this could cause OOM on memory-constrained CI runners.

Now both paths stat the file for its size, write the git blob header, then
stream through a 64KB BufReader. Peak memory per hash call is bounded
regardless of file size. Hash output is identical — verified against
git hash-object.
@anthonyshew anthonyshew requested a review from a team as a code owner February 28, 2026 04:41
@anthonyshew anthonyshew requested review from tknickman and removed request for a team February 28, 2026 04:41
@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented Feb 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
examples-basic-web Ready Ready Preview, Comment, Open in v0 Feb 28, 2026 4:49am
examples-designsystem-docs Ready Ready Preview, Comment, Open in v0 Feb 28, 2026 4:49am
examples-gatsby-web Ready Ready Preview, Comment, Open in v0 Feb 28, 2026 4:49am
examples-kitchensink-blog Ready Ready Preview, Comment, Open in v0 Feb 28, 2026 4:49am
examples-nonmonorepo Ready Ready Preview, Comment, Open in v0 Feb 28, 2026 4:49am
examples-svelte-web Ready Ready Preview, Comment, Open in v0 Feb 28, 2026 4:49am
examples-tailwind-web Ready Ready Preview, Comment, Open in v0 Feb 28, 2026 4:49am
examples-vite-web Ready Ready Preview, Comment, Open in v0 Feb 28, 2026 4:49am
turbo-site Ready Ready Preview, Comment, Open in v0 Feb 28, 2026 4:49am
turborepo-agents Ready Ready Preview, Comment, Open in v0 Feb 28, 2026 4:49am
turborepo-test-coverage Ready Ready Preview, Comment, Open in v0 Feb 28, 2026 4:49am

Comment thread crates/turborepo-scm/src/manual.rs Outdated
Comment thread crates/turborepo-scm/src/manual.rs Outdated
- Propagate metadata error with ? instead of silently falling back to
  size 0, which would produce an incorrect blob hash header.
- Use std::fs::write in test to correctly write binary content instead
  of silently writing empty files via str::from_utf8 fallback.
@github-actions
Copy link
Copy Markdown
Contributor

Coverage Report

Metric Coverage
Lines 81.39%
Functions 53.22%
Branches 0.00%

View full report

@anthonyshew anthonyshew changed the title perf: Stream file contents during hashing to prevent OOM on large repos perf: Stream file contents during hashing to lower memory usage Feb 28, 2026
@anthonyshew anthonyshew merged commit f03cdce into main Feb 28, 2026
73 checks passed
@anthonyshew anthonyshew deleted the anthonyshew/streaming-file-hash branch February 28, 2026 05:00
github-actions Bot added a commit that referenced this pull request Feb 28, 2026
## Release v2.8.13-canary.8

Versioned docs: https://v2-8-13-canary-8.turborepo.dev

### Changes

- fix: Exclude peer dependencies from workspace external dep resolution
(#12050) (`3a75547`)
- test: Port all 15 workspace-configs prysk tests to Rust (#12058)
(`55442be`)
- release(turborepo): 2.8.13-canary.7 (#12060) (`495afdc`)
- perf: Stream file contents during hashing to lower memory usage
(#12059) (`f03cdce`)
- fix: Treat `npm: alias` dependencies as external, not workspace
references (#12061) (`b179cb8`)
- test: Port 18 more prysk tests to Rust (other/ +
lockfile-aware-caching/) (#12062) (`7887af2`)

---------

Co-authored-by: Turbobot <turbobot@vercel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant