Skip to content

Conversation

@jakubno
Copy link
Member

@jakubno jakubno commented Nov 11, 2025

Check that the lock works as expected (logs were removed afterwards)
image

Left is with locking (cache NFS was cleaned and orchestrator redeployed, so both caches were clean)

image image

Without the lock the same load timed out.


Note

Introduce a file-based lock with TTL and integrate it into cache writes to prevent concurrent NFS writes; add comprehensive tests.

  • Storage / Locking:
    • New packages/shared/pkg/storage/lock: Implement file-based lock with TTL cleanup (TryAcquireLock, ReleaseLock, ErrLockAlreadyHeld).
    • Tests: Add concurrency, staleness, and path consistency tests for the lock.
  • Cache Providers:
    • CachedObjectProvider: Wrap full-file cache writes with lock acquisition; skip on ErrLockAlreadyHeld; add warnings.
    • CachedSeekableObjectProvider:
      • Lock around chunk writes in writeChunkToCache.
      • Lock around size writes in writeLocalSize.
      • Skip writes if lock already held; emit warnings on lock errors.

Written by Cursor Bugbot for commit cc765fc. This will update automatically on new commits. Configure here.

@linear
Copy link

linear bot commented Nov 11, 2025

@jakubno jakubno added the improvement Improvement for current functionality label Nov 11, 2025
@jakubno jakubno marked this pull request as draft November 11, 2025 16:08
chatgpt-codex-connector[bot]

This comment was marked as resolved.

@jakubno
Copy link
Member Author

jakubno commented Nov 11, 2025

bugbot run

@ValentaTomas
Copy link
Member

I assume this PR is intended to prevent the avalanche of requests when we start multiple orchestrators?

@ValentaTomas
Copy link
Member

Was the problem with the NFS then throughput or IOPS?

@jakubno jakubno marked this pull request as ready for review November 11, 2025 18:52
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ValentaTomas
Copy link
Member

Consideration—with the way the cleanup script picks files, can it delete the lock files?

@ValentaTomas
Copy link
Member

Alternative—can we make the writes/reads to NFS timeout if they are taking too long, and just fetch from storage instead?

@ValentaTomas
Copy link
Member

Alternative—if we limit the concurrency of writing things to cache per orchestrator, can the usage reasonably flatten (until we have too many orchestrators)?

@jakubno
Copy link
Member Author

jakubno commented Nov 12, 2025

Consideration—with the way the cleanup script picks files, can it delete the lock files?

yes

@jakubno
Copy link
Member Author

jakubno commented Nov 12, 2025

Alternative—can we make the writes/reads to NFS timeout if they are taking too long, and just fetch from storage instead?

yes, but that's a different problem, you want to prevent the requests in the first place

@jakubno
Copy link
Member Author

jakubno commented Nov 12, 2025

Alternative—if we limit the concurrency of writing things to cache per orchestrator, can the usage reasonably flatten (until we have too many orchestrators)?

This is caused by the concurrency, the traffic from POV of one of the nodes is completely reasonable (and legit)

@jakubno jakubno merged commit 553a842 into main Nov 12, 2025
28 checks passed
@jakubno jakubno deleted the implement-locking-mechanism-for-nfs-cache-to-prevent-usage-eng-3293 branch November 12, 2025 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement for current functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants