Skip to content

feat: wire in FFT worker#114

Merged
ShubyM merged 3 commits into
gke-labs:mainfrom
ShubyM:feat/fft-support
Jun 5, 2026
Merged

feat: wire in FFT worker#114
ShubyM merged 3 commits into
gke-labs:mainfrom
ShubyM:feat/fft-support

Conversation

@ShubyM
Copy link
Copy Markdown
Collaborator

@ShubyM ShubyM commented Jun 5, 2026

This PR builds on the worker split from the last PR and wires up the first usable full fine tuning mode. When OPEN_RL_WORKER_MODE=full is set, creating a model launches a dedicated FFT worker process for that job, and requests for that model are routed through Redis to that worker. LoRA keeps using the existing multi-tenant path. The worker launch logic lives in a small FFTWorkerLauncher class so the gateway owns the subprocess lifecycle in one place.

This also adds a GSM8K example so we have a concrete script for running and evaling FFT. Base line show ~1 to 5 percent accuracy and final shows ~34-35%

No snapshot integration yet.

Also while writing this I think we can start leveraging more features of FastAPI for typing the shape of the requests coming through. Seems like a good time to do it (not a high priority though)

ShubyM added 2 commits June 5, 2026 13:34
Add OPEN_RL_WORKER_MODE=full as the gateway-selected path for launching one FFT training worker process per created model.

The gateway owns the FFTWorkerLauncher lifecycle, starts a run-scoped clock_cycle process before enqueueing create_model/create_model_from_state, and shuts down launched workers with the FastAPI lifespan.

The worker loop now chooses FFTTrainingWorker in full mode, drains only its model queue through Redis, saves full checkpoints for sampler exports, and keeps public Tinker SDK metadata LoRA-shaped until the client has native full fine-tuning support.

Also moves create_model loading into the trainer workers and covers the LoRA/FFT create_model, checkpoint path, sampler-save, and optimizer behavior in tests.
@ShubyM ShubyM requested a review from droot June 5, 2026 20:41
Comment thread src/server/clock_cycle.py Outdated
from tinker_cookbook.supervised.train import Config as TrainConfig
from tinker_cookbook.supervised.train import main as train
from tinker_cookbook.supervised.types import SupervisedDatasetBuilder
from tinker_cookbook.tokenizer_utils import get_tokenizer
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So nice to see FFT working with the tinker-cookbook :)

Comment thread examples/sft/gsm8k/vllm_eval.py
param.requires_grad_(True)
self.trainable_params = trainable_model_parameters(self.model)

if ENABLE_GRADIENT_CHECKPOINTING:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is applicable to FFT as well.

return {"model_id": model_id, "is_lora": False, "base_model": base_model}
# SDK compatibility: the public client currently expects LoRA-shaped training metadata,
# even when this worker loaded full fine-tuned weights.
return {"model_id": model_id, "is_lora": True, "lora_rank": 16, "base_model": base_model}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Workers are backend processors so this code here feels like tight coupling between the API and the backend workers.

We should have an internal Result message between API <---> Worker and API should translate that appropriately to maintain compatibility with different types of clients (like tinker compatible in this case).

Also an opportunity to define typed messages between API <---> Worker because eventually they will communicate over the network.

Comment thread src/server/gateway.py Outdated
def get_default_model_name() -> str | None:
if is_single_process_mode():
if is_single_process_mode() and not is_full_worker_mode():
import clock_cycle
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conditional import is typically a code smell. we should have explicit initialization and I thin it's related to initialization of worker on package import I saw earlier.

Comment thread src/server/gateway.py Outdated
Comment thread src/server/gateway.py Outdated
Comment thread src/server/gateway.py Outdated

env = {**os.environ, "OPEN_RL_WORKER_MODE": "full", "OPEN_RL_WORKER_MODEL_ID": model_id}
self.processes[model_id] = subprocess.Popen(
[sys.executable, "-m", "clock_cycle"],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have to rename clock_cycle to something else... it's a request processor actually. but not this PR.

Comment thread src/server/gateway.py Outdated
# Resolve relative names under TMP_DIR/checkpoints, leave absolute paths alone.
resolved_path = state_path if os.path.isabs(state_path) else os.path.join(TMP_DIR, "checkpoints", state_path)
model_id = str(uuid.uuid4())
if is_full_worker_mode():
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still thinking...one improvement we can make to keep the architecture decoupled. launching of the worker should be done in the request processing part instead of the API server. Let the API server enqueue requests.

Haven't thought through completely yet, so it's okay to do that as a follow-up.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this so create_model no longer launches a worker directly from the API handler. In FFT mode the gateway writes the create-model payload to a separate worker-launch queue. WorkerLaunchProcessor drains that queue, asks FFTWorkerManager to start the per-model worker process, and only then enqueues the original request onto the normal per-model training queue.

@ShubyM ShubyM requested a review from droot June 5, 2026 23:36
@ShubyM ShubyM merged commit 5f3957e into gke-labs:main Jun 5, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants