Skip to content

feat: snapshot agent#109

Merged
droot merged 1 commit into
gke-labs:mainfrom
ShubyM:feat/snapshot-agent
Jun 3, 2026
Merged

feat: snapshot agent#109
droot merged 1 commit into
gke-labs:mainfrom
ShubyM:feat/snapshot-agent

Conversation

@ShubyM

@ShubyM ShubyM commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Introduces snapshot agent + client that training workers will consume to have a lock over GPU resources

@ShubyM ShubyM requested a review from droot June 3, 2026 20:18
start = time.perf_counter()
logger.info("restore pid=%s", pid)
self.run_cuda_checkpoint(["--action", "restore", "--pid", str(pid)])
self.run_cuda_checkpoint(["--action", "unlock", "--pid", str(pid)])

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in our internal code we used cuda-checkpoint --toggle to restore and I am not sure if it matters.


It exposes four commands over a Unix socket:

- `REGISTER(run_id, pid)` records the process that owns a run.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is run_id ?

@droot droot merged commit d581e52 into gke-labs:main Jun 3, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants