Skip to content

Fix concurrency race condition in resourcectl local state management#934

Open
aditya-shantanu wants to merge 4 commits into
kubernetes-sigs:mainfrom
aditya-shantanu:fix-resourcectl-race-condition
Open

Fix concurrency race condition in resourcectl local state management#934
aditya-shantanu wants to merge 4 commits into
kubernetes-sigs:mainfrom
aditya-shantanu:fix-resourcectl-race-condition

Conversation

@aditya-shantanu
Copy link
Copy Markdown
Collaborator

@aditya-shantanu aditya-shantanu commented Jun 4, 2026

What this PR does / why we need it:

This PR fixes a concurrency race condition in the local state management of the resourcectl CLI utility (dev/tools/resourcectl/main.go). Previously, multiple instances of resourcectl get running concurrently would read the same ~/.local/resourcectl/state.json file and overwrite each other's additions, leading to permanent loss of tracked Boskos resources and resource leaks as their heartbeat processes ran indefinitely. This change introduces file-based locking around the read-modify-write cycle of the state file using github.com/gofrs/flock to guarantee atomicity and prevent data loss.

Which issue(s) this PR is related to:

Fixes https://b.corp.google.com/issues/511322627

Release Note

NONE

Summary by CodeRabbit

  • Bug Fixes

    • Serialized CLI state access with file locking to prevent corruption during concurrent runs.
    • Background heartbeat is only started after state is persisted to avoid orphaned processes and ensure recorded PIDs are durable.
    • Cleanup now safely handles concurrent updates and re-appends resources that fail to release without overwriting new entries.
  • Tests

    • Added a concurrency test validating safe simultaneous state updates and deduplication.

#### What this PR does / why we need it:
This PR fixes a concurrency race condition in the local state management of the `resourcectl` CLI utility (`dev/tools/resourcectl/main.go`). Previously, multiple instances of `resourcectl get` running concurrently would read the same `~/.local/resourcectl/state.json` file and overwrite each other's additions, leading to permanent loss of tracked Boskos resources and resource leaks as their heartbeat processes ran indefinitely. This change introduces file-based locking around the read-modify-write cycle of the state file using `github.com/gofrs/flock` to guarantee atomicity and prevent data loss.

#### Which issue(s) this PR is related to:
Fixes https://b.corp.google.com/issues/511322627

#### Release Note
```release-note
NONE
```
@netlify
Copy link
Copy Markdown

netlify Bot commented Jun 4, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit 4183f1b
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/6a21bb11a2a847000890438b

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aditya-shantanu
Once this PR has been reviewed and has the lgtm label, please assign igooch for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from janetkuo and justinsb June 4, 2026 15:51
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 4, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 8a21c2ae-27f8-449a-bfbd-6bfdaac8cf62

📥 Commits

Reviewing files that changed from the base of the PR and between 56fc47d and 4183f1b.

📒 Files selected for processing (1)
  • dev/tools/resourcectl/main.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • dev/tools/resourcectl/main.go

📝 Walkthrough

Walkthrough

Add file-based locking to serialize state.json access: introduce a lock-file helper and updateState(fn) that acquires a flock for guarded read-modify-write, refactor runGet/runCleanup to use the helper, add a concurrent test, and promote flock to a direct go.mod dependency.

Changes

State file locking via flock

Layer / File(s) Summary
Module dependency bump
dev/tools/go.mod
Promote github.com/gofrs/flock to a direct requirement in the tools module.
Lock infrastructure setup
dev/tools/resourcectl/main.go
Imported github.com/gofrs/flock and added stateLockFilePath() to derive state.lock from the state directory.
Guarded state update helper
dev/tools/resourcectl/main.go
Added updateState(fn) which acquires a flock, runs a read-modify-write closure, writes the updated state, and releases the lock.
runGet: persist acquired resource under lock
dev/tools/resourcectl/main.go
Refactored runGet to append new BoskosResources via updateState, starting heartbeats while the lock is held to avoid orphaned heartbeats.
runCleanup: snapshot and clear under lock
dev/tools/resourcectl/main.go
Refactored runCleanup to snapshot and clear tracked resources under lock, then release the lock before performing kill/release operations.
runCleanup: re-persist remaining failures
dev/tools/resourcectl/main.go
After release attempts, re-acquire the lock only to append any remaining resources back into state without clobbering concurrent updates.
Concurrency test for state updates
dev/tools/resourcectl/main_test.go
Add TestConcurrentStateUpdates to run multiple goroutines calling updateState concurrently and assert all appended entries persist.

🎯 4 (Complex) | ⏱️ ~45 minutes

"I'm a rabbit with a tiny key,
I guard the file so states agree.
Locks in place, no more distress,
Concurrent writes now pass the test. 🐇🔐"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: fixing a concurrency race condition in resourcectl's local state management, which is the primary objective of the PR.
Description check ✅ Passed The PR description includes all required template sections: a clear explanation of what the PR does and why it's needed, a reference to the related issue with 'Fixes' to auto-close it, and a release note section.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 4, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@dev/tools/resourcectl/main.go`:
- Around line 319-327: The file lock acquired via stateLockFilePath() and
fileLock.Lock() currently surrounds the entire cleanup kill/release loop and
blocks other state operations; change to a two-phase approach: acquire fileLock
only to read and deserialize state (use fileLock.Lock()/Unlock()), then
immediately release before performing slow external kill/release calls, and
finally re-acquire the lock to apply and persist state changes (use
fileLock.Lock()/Unlock() around the write). Locate usages of
stateLockFilePath(), fileLock, and the cleanup kill/release loop in the cleanup
routine and refactor so external calls run outside the locked section;
re-acquire the same lock only when writing back updated state to disk.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4cb68afc-be34-41f3-9982-5a955c94c191

📥 Commits

Reviewing files that changed from the base of the PR and between 8cff34d and f7a3ab0.

📒 Files selected for processing (1)
  • dev/tools/resourcectl/main.go

Comment thread dev/tools/resourcectl/main.go Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a real concurrency issue in resourcectl’s local state management by introducing a file lock to protect the read/modify/write cycle of ~/.local/resourcectl/state.json, preventing concurrent get/cleanup invocations from overwriting each other and losing tracked resources.

Changes:

  • Add a dedicated lock file (state.lock) colocated with the state file.
  • Use github.com/gofrs/flock to serialize state updates in runGet and runCleanup.

Comment thread dev/tools/resourcectl/main.go Outdated
Comment thread dev/tools/resourcectl/main.go Outdated
Comment thread dev/tools/resourcectl/main.go Outdated
Address PR review feedback:

- runCleanup: split into two phases so the state-file lock is no longer
  held across the slow kill/release calls (network I/O to Boskos).
  Phase 1 atomically takes ownership of the tracked resources and clears
  the state under the lock; phase 2 kills heartbeats and releases from
  Boskos without the lock; phase 3 re-acquires the lock and re-adds only
  the resources that failed to release, re-reading state so concurrently
  added resources are not clobbered.

- runGet: acquire the state lock before starting the heartbeat process so
  a process blocked on the lock never leaves an orphaned heartbeat if it
  is interrupted before persisting the resource.

- Add updateState helper for the locked read-modify-write cycle shared by
  runGet and runCleanup.

- Add TestConcurrentStateUpdates exercising concurrent state updates under
  contention and asserting no entries are lost.

- go.mod: promote github.com/gofrs/flock from indirect to a direct
  dependency (go mod tidy), fixing presubmit-test-autogen-up-to-date.
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 4, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
dev/tools/resourcectl/main_test.go (1)

125-130: ⚡ Quick win

Assert the resource identities, not just the final count.

A broken write path can still satisfy len(state.BoskosResources) == n while duplicating or corrupting entries. Checking the expected resource-%d set would make this test actually prove that no update was clobbered under contention.

Proposed assertion tightening
 	state, err := readState()
 	if err != nil {
 		t.Fatalf("failed to read state file: %v", err)
 	}
 	if len(state.BoskosResources) != n {
 		t.Errorf("expected %d resources after concurrent updates, but got %d", n, len(state.BoskosResources))
 	}
+	seen := make(map[string]int, len(state.BoskosResources))
+	for _, resource := range state.BoskosResources {
+		seen[resource.Name]++
+	}
+	for i := 0; i < n; i++ {
+		name := fmt.Sprintf("resource-%d", i)
+		if seen[name] != 1 {
+			t.Fatalf("expected exactly one %q entry, got %d", name, seen[name])
+		}
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dev/tools/resourcectl/main_test.go` around lines 125 - 130, The test reads
state via readState() and only asserts len(state.BoskosResources) == n, which
can miss duplicates or corrupted entries; update the test to assert the exact
set of resource IDs (e.g., "resource-%d") is present and unique in
state.BoskosResources after concurrent updates: build the expected names for i
in 0..n-1, iterate state.BoskosResources to collect actual IDs, verify the sets
are equal and that there are no duplicates (or that every expected "resource-%d"
exists), and fail the test with a clear message if any expected ID is missing or
duplicated.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@dev/tools/resourcectl/main.go`:
- Around line 302-320: The code starts the heartbeat process (cmd) then appends
a BoskosResource and relies on updateState/writeState to persist it; if
persistence fails the heartbeat is left running. After cmd.Start() and before
returning the error from updateState/writeState failures, ensure best-effort
cleanup: kill the heartbeat process group (use negative PID since
SysProcAttr.Setpgid=true), wait for the process to exit, and if SIGTERM fails
escalate to SIGKILL; log any cleanup errors. Update the runGet/updateState
caller path to perform this cleanup when updateState or writeState returns an
error so no orphaned heartbeat (referenced symbols: cmd, SysProcAttr,
cmd.Process.Pid, BoskosResource, updateState, writeState, heartbeat).

---

Nitpick comments:
In `@dev/tools/resourcectl/main_test.go`:
- Around line 125-130: The test reads state via readState() and only asserts
len(state.BoskosResources) == n, which can miss duplicates or corrupted entries;
update the test to assert the exact set of resource IDs (e.g., "resource-%d") is
present and unique in state.BoskosResources after concurrent updates: build the
expected names for i in 0..n-1, iterate state.BoskosResources to collect actual
IDs, verify the sets are equal and that there are no duplicates (or that every
expected "resource-%d" exists), and fail the test with a clear message if any
expected ID is missing or duplicated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 71ba71fb-3f58-4a74-9059-bad39ceae340

📥 Commits

Reviewing files that changed from the base of the PR and between f7a3ab0 and 2cbac4f.

📒 Files selected for processing (3)
  • dev/tools/go.mod
  • dev/tools/resourcectl/main.go
  • dev/tools/resourcectl/main_test.go

Comment thread dev/tools/resourcectl/main.go
Comment on lines +332 to +340

select {
case <-time.After(time.Second):
if killErr := syscall.Kill(pgid, syscall.SIGKILL); killErr != nil {
log.Error(killErr, "failed to send SIGKILL to heartbeat process group", "pgid", pgid)
}
<-done
case <-done:
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since network connectivity to Boskos is likely still functional even if local disk persistence failed, suggest performing an immediate best-effort release. This frees the resource instantly for other CI jobs rather than waiting 10-15 minutes for the reaper.

// Best-effort release from Boskos so the resource isn't tied up waiting for the reaper.
br := BoskosResource{Name: resource.Name, Owner: owner}
if relErr := br.ReleaseFromBoskos(ctx); relErr != nil {
	log.Error(relErr, "failed to release resource from boskos after state update failure", "name", resource.Name)
}

defer os.RemoveAll(tmpHome)

origHome := os.Getenv("HOME")
os.Setenv("HOME", tmpHome)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use t.Setenv to manage test environment variables is more idiomatic in modern Go and automatically handles restoring the original environment variable when the test completes, even if it was originally unset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ready-for-review size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

4 participants