Fix concurrency race condition in resourcectl local state management#934
Fix concurrency race condition in resourcectl local state management#934aditya-shantanu wants to merge 4 commits into
Conversation
#### What this PR does / why we need it: This PR fixes a concurrency race condition in the local state management of the `resourcectl` CLI utility (`dev/tools/resourcectl/main.go`). Previously, multiple instances of `resourcectl get` running concurrently would read the same `~/.local/resourcectl/state.json` file and overwrite each other's additions, leading to permanent loss of tracked Boskos resources and resource leaks as their heartbeat processes ran indefinitely. This change introduces file-based locking around the read-modify-write cycle of the state file using `github.com/gofrs/flock` to guarantee atomicity and prevent data loss. #### Which issue(s) this PR is related to: Fixes https://b.corp.google.com/issues/511322627 #### Release Note ```release-note NONE ```
✅ Deploy Preview for agent-sandbox canceled.
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aditya-shantanu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdd file-based locking to serialize state.json access: introduce a lock-file helper and updateState(fn) that acquires a flock for guarded read-modify-write, refactor runGet/runCleanup to use the helper, add a concurrent test, and promote flock to a direct go.mod dependency. ChangesState file locking via flock
🎯 4 (Complex) | ⏱️ ~45 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@dev/tools/resourcectl/main.go`:
- Around line 319-327: The file lock acquired via stateLockFilePath() and
fileLock.Lock() currently surrounds the entire cleanup kill/release loop and
blocks other state operations; change to a two-phase approach: acquire fileLock
only to read and deserialize state (use fileLock.Lock()/Unlock()), then
immediately release before performing slow external kill/release calls, and
finally re-acquire the lock to apply and persist state changes (use
fileLock.Lock()/Unlock() around the write). Locate usages of
stateLockFilePath(), fileLock, and the cleanup kill/release loop in the cleanup
routine and refactor so external calls run outside the locked section;
re-acquire the same lock only when writing back updated state to disk.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4cb68afc-be34-41f3-9982-5a955c94c191
📒 Files selected for processing (1)
dev/tools/resourcectl/main.go
There was a problem hiding this comment.
Pull request overview
This PR addresses a real concurrency issue in resourcectl’s local state management by introducing a file lock to protect the read/modify/write cycle of ~/.local/resourcectl/state.json, preventing concurrent get/cleanup invocations from overwriting each other and losing tracked resources.
Changes:
- Add a dedicated lock file (
state.lock) colocated with the state file. - Use
github.com/gofrs/flockto serialize state updates inrunGetandrunCleanup.
Address PR review feedback: - runCleanup: split into two phases so the state-file lock is no longer held across the slow kill/release calls (network I/O to Boskos). Phase 1 atomically takes ownership of the tracked resources and clears the state under the lock; phase 2 kills heartbeats and releases from Boskos without the lock; phase 3 re-acquires the lock and re-adds only the resources that failed to release, re-reading state so concurrently added resources are not clobbered. - runGet: acquire the state lock before starting the heartbeat process so a process blocked on the lock never leaves an orphaned heartbeat if it is interrupted before persisting the resource. - Add updateState helper for the locked read-modify-write cycle shared by runGet and runCleanup. - Add TestConcurrentStateUpdates exercising concurrent state updates under contention and asserting no entries are lost. - go.mod: promote github.com/gofrs/flock from indirect to a direct dependency (go mod tidy), fixing presubmit-test-autogen-up-to-date.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
dev/tools/resourcectl/main_test.go (1)
125-130: ⚡ Quick winAssert the resource identities, not just the final count.
A broken write path can still satisfy
len(state.BoskosResources) == nwhile duplicating or corrupting entries. Checking the expectedresource-%dset would make this test actually prove that no update was clobbered under contention.Proposed assertion tightening
state, err := readState() if err != nil { t.Fatalf("failed to read state file: %v", err) } if len(state.BoskosResources) != n { t.Errorf("expected %d resources after concurrent updates, but got %d", n, len(state.BoskosResources)) } + seen := make(map[string]int, len(state.BoskosResources)) + for _, resource := range state.BoskosResources { + seen[resource.Name]++ + } + for i := 0; i < n; i++ { + name := fmt.Sprintf("resource-%d", i) + if seen[name] != 1 { + t.Fatalf("expected exactly one %q entry, got %d", name, seen[name]) + } + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@dev/tools/resourcectl/main_test.go` around lines 125 - 130, The test reads state via readState() and only asserts len(state.BoskosResources) == n, which can miss duplicates or corrupted entries; update the test to assert the exact set of resource IDs (e.g., "resource-%d") is present and unique in state.BoskosResources after concurrent updates: build the expected names for i in 0..n-1, iterate state.BoskosResources to collect actual IDs, verify the sets are equal and that there are no duplicates (or that every expected "resource-%d" exists), and fail the test with a clear message if any expected ID is missing or duplicated.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@dev/tools/resourcectl/main.go`:
- Around line 302-320: The code starts the heartbeat process (cmd) then appends
a BoskosResource and relies on updateState/writeState to persist it; if
persistence fails the heartbeat is left running. After cmd.Start() and before
returning the error from updateState/writeState failures, ensure best-effort
cleanup: kill the heartbeat process group (use negative PID since
SysProcAttr.Setpgid=true), wait for the process to exit, and if SIGTERM fails
escalate to SIGKILL; log any cleanup errors. Update the runGet/updateState
caller path to perform this cleanup when updateState or writeState returns an
error so no orphaned heartbeat (referenced symbols: cmd, SysProcAttr,
cmd.Process.Pid, BoskosResource, updateState, writeState, heartbeat).
---
Nitpick comments:
In `@dev/tools/resourcectl/main_test.go`:
- Around line 125-130: The test reads state via readState() and only asserts
len(state.BoskosResources) == n, which can miss duplicates or corrupted entries;
update the test to assert the exact set of resource IDs (e.g., "resource-%d") is
present and unique in state.BoskosResources after concurrent updates: build the
expected names for i in 0..n-1, iterate state.BoskosResources to collect actual
IDs, verify the sets are equal and that there are no duplicates (or that every
expected "resource-%d" exists), and fail the test with a clear message if any
expected ID is missing or duplicated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 71ba71fb-3f58-4a74-9059-bad39ceae340
📒 Files selected for processing (3)
dev/tools/go.moddev/tools/resourcectl/main.godev/tools/resourcectl/main_test.go
|
|
||
| select { | ||
| case <-time.After(time.Second): | ||
| if killErr := syscall.Kill(pgid, syscall.SIGKILL); killErr != nil { | ||
| log.Error(killErr, "failed to send SIGKILL to heartbeat process group", "pgid", pgid) | ||
| } | ||
| <-done | ||
| case <-done: | ||
| } |
There was a problem hiding this comment.
Since network connectivity to Boskos is likely still functional even if local disk persistence failed, suggest performing an immediate best-effort release. This frees the resource instantly for other CI jobs rather than waiting 10-15 minutes for the reaper.
// Best-effort release from Boskos so the resource isn't tied up waiting for the reaper.
br := BoskosResource{Name: resource.Name, Owner: owner}
if relErr := br.ReleaseFromBoskos(ctx); relErr != nil {
log.Error(relErr, "failed to release resource from boskos after state update failure", "name", resource.Name)
}| defer os.RemoveAll(tmpHome) | ||
|
|
||
| origHome := os.Getenv("HOME") | ||
| os.Setenv("HOME", tmpHome) |
There was a problem hiding this comment.
Use t.Setenv to manage test environment variables is more idiomatic in modern Go and automatically handles restoring the original environment variable when the test completes, even if it was originally unset.
What this PR does / why we need it:
This PR fixes a concurrency race condition in the local state management of the
resourcectlCLI utility (dev/tools/resourcectl/main.go). Previously, multiple instances ofresourcectl getrunning concurrently would read the same~/.local/resourcectl/state.jsonfile and overwrite each other's additions, leading to permanent loss of tracked Boskos resources and resource leaks as their heartbeat processes ran indefinitely. This change introduces file-based locking around the read-modify-write cycle of the state file usinggithub.1485827954.workers.dev/gofrs/flockto guarantee atomicity and prevent data loss.Which issue(s) this PR is related to:
Fixes https://b.corp.google.com/issues/511322627
Release Note
Summary by CodeRabbit
Bug Fixes
Tests