Copy-paste flows for RunPod (submission-style) and Modal (remote GPU without managing pods).
GCP / Azure: same ideas (VM + data disk + bundle results); sections reserved below.
To open a second checkout without switching branches:
git worktree add ../parameter-golf-second mainFirst time with a network volume? Seed /workspace once (clone + FineWeb), then use the steps below on every pod.
From repo root on the pod (e.g. cd /workspace/parameter-golf after git pull):
-
Preflight (Python/torch/CUDA, GPU count, dataset + tokenizer):
./scripts/runpod_preflight.sh 8
Use
1instead of8on a 1×H100 pod. -
Train (sets
RUN_ID,DATA_PATH,TOKENIZER_PATH,VOCAB_SIZE, runstorchrun):./scripts/runpod_train.sh 8x --run-id prod_8xh100_$(date -u +%Y%m%d_%H%MZ)Smoke on 1×H100:
./scripts/runpod_train.sh 1x --run-id smoke_001Record script: add
--train-gpt records/track_10min_16mb/<your_record>/train_gpt.py -
Finish (meta +
metrics.json+ tarball forrunpodctl):./scripts/runpod_finish.sh "<same RUN_ID as step 2>"
Nothing needs to be “built” locally for RunPod: the Parameter Golf template image already includes PyTorch and deps. These scripts only validate paths and standardize env + post-run packaging.
| Item | Location |
|---|---|
| Training log | logs/<RUN_ID>.txt |
| Quantized artifact | final_model.int8.ptz (and optional final_model.pt) |
| Repro metadata | logs/<RUN_ID>.meta.txt (optional but recommended) |
| Parsed metrics JSON | logs/<RUN_ID>.metrics.json (optional; from script) |
Use a unique RUN_ID per prod run (e.g. prod_8xh100_20260320).
Official template: Parameter Golf on RunPod Hub — image runpod/parameter-golf:latest (Python, PyTorch, deps pre-installed).
You create and attach volumes in the RunPod console (we cannot do that from this repo).
A network volume is separate billable storage (~$0.07/GB/mo for the first 1 TB) that persists when pods are deleted. When attached to a Pod, it replaces the normal volume disk and is mounted at /workspace. That way you pay once (time + egress) to download FineWeb, then later pods mostly run training only.
Constraints from RunPod (network volumes, storage options):
- Secure Cloud pods only (community cloud may not offer network volumes).
- Attach the volume only when deploying the Pod — not after the fact (no hot-attach).
- Pick a datacenter for the volume; GPU SKUs you can choose may depend on that region.
- Open RunPod Storage (or New Network Volume from the product UI).
- Click New Network Volume / Create Network Volume.
- Choose a datacenter (note it — use the same region when picking GPUs).
- Set a name (e.g.
parameter-golf-data) and size (e.g. 100 GB+ if you want fullsp1024shards + repo + checkpoints; size can increase later, not shrink). - Create the volume.
Optional API (replace RUNPOD_API_KEY and pick a valid dataCenterId from their API/docs):
curl --request POST \
--url https://rest.runpod.io/v1/networkvolumes \
--header 'Authorization: Bearer RUNPOD_API_KEY' \
--header 'Content-Type: application/json' \
--data '{"name":"parameter-golf-data","size":100,"dataCenterId":"US-KS-2"}'See POST /networkvolumes.
- Go to Deploy a Pod.
- Under storage, choose Network volume and select the volume you created (this mounts it as
/workspace). - Select GPU (must be available in a datacenter compatible with the volume).
- Use template Parameter Golf / image
runpod/parameter-golf:latestas usual. - Deploy.
On the pod (SSH or web terminal), /workspace is your persistent disk. Populate it once:
Option A — curl installer (no prior clone):
curl -fsSL "https://raw.githubusercontent.com/machdragon/parameter-golf/main/scripts/runpod_seed_workspace.sh" | bashUse a fork if needed:
export GIT_REPO_URL=https://github.com/YOUR_USER/parameter-golf.git
curl -fsSL "https://raw.githubusercontent.com/machdragon/parameter-golf/main/scripts/runpod_seed_workspace.sh" | bashSmall subset while testing (saves download time):
export TRAIN_SHARDS=1
curl -fsSL "https://raw.githubusercontent.com/machdragon/parameter-golf/main/scripts/runpod_seed_workspace.sh" | bashOption B — manual:
cd /workspace
git clone https://github.com/machdragon/parameter-golf.git
cd parameter-golf
python3 data/cached_challenge_fineweb.py --variant sp1024 # add --train-shards N if desiredDeploy a new Pod, attach the same network volume again. Then:
cd /workspace/parameter-golf
git pull
./scripts/runpod_preflight.sh 8
./scripts/runpod_train.sh 8x --run-id prod_8xh100_$(date -u +%Y%m%d_%H%MZ)No re-download if data/ is already on the volume.
Optional: S3-compatible API can upload data without a running GPU pod to reduce wasted GPU time (advanced).
- Network volume = persistent
/workspace; see above. - Smaller data while debugging:
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards NorTRAIN_SHARDS=Nwithrunpod_seed_workspace.sh.
If you prefer explicit exports:
1× H100 smoke
RUN_ID=smoke_1xh100 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=1 train_gpt.py8× H100 (submission wallclock)
RUN_ID=prod_8xh100 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=8 train_gpt.pyTraining writes under the current working directory (final_model.*, logs/<RUN_ID>.txt).
One shot: ./scripts/runpod_finish.sh "$RUN_ID" (runs the steps below).
Piecemeal:
-
Optional: record environment metadata (still on the pod, repo root):
export RUN_ID=prod_8xh100 ./scripts/make_run_meta.sh -
Optional: JSON summary from the log:
python3 scripts/extract_run_metrics.py "logs/${RUN_ID}.txt" -o "logs/${RUN_ID}.metrics.json"
-
Bundle for one file to transfer:
./scripts/bundle_run_results.sh "$RUN_ID" "logs/${RUN_ID}_bundle.tar.gz"
-
Transfer (many pods use SSH that does not expose SCP/SFTP — use RunPod’s tool):
On the pod
cd /workspace/parameter-golf # or your path runpodctl send "logs/${RUN_ID}_bundle.tar.gz"
On your laptop
runpodctl receive <ONE_TIME_CODE>
Docs: Transfer files, SSH setup.
-
Alternative: enable full SSH (public IP + port) and use
scp/rsyncif you prefer.
python3 -m venv .venv-modal
.venv-modal/bin/pip install -U pip modal
.venv-modal/bin/modal setupModal never reads ./data/ from your laptop. Training looks only at the named volume mounted as /vol inside the container (DATA_PATH=/vol/datasets/..., TOKENIZER_PATH=/vol/tokenizers/...).
- Per Modal account: run
./scripts/modal_sync_data.shonce (or again after switching Modal logins / empty volume). It uploads:- local
./data/datasets/fineweb10B_sp1024/→ volume pathdatasets/fineweb10B_sp1024/ - local
./data/tokenizers/fineweb_1024_bpe.model→tokenizers/fineweb_1024_bpe.model
- local
- Confirm:
./scripts/modal_sync_data.sh
.venv-modal/bin/modal volume ls parameter-golf-data
.venv-modal/bin/modal volume ls parameter-golf-data tokenizersIf fineweb_1024_bpe.model is missing on the volume, you will get Not found: "/vol/tokenizers/..." during training until you sync.
If modal volume put errors with already exists on dataset shards, a previous sync already uploaded them; re-run ./scripts/modal_sync_data.sh — it continues and always re-uploads the tokenizer with --force. To overwrite all shards: MODAL_SYNC_FORCE=1 ./scripts/modal_sync_data.sh.
.venv-modal/bin/modal run scripts/modal_train_h100.py --run-id my-modal-1xFaster image build than the LAWA/FA3 script. This script uses Modal's uv_pip_install on the PyTorch registry image and the same /vol dataset + tokenizer paths as the FA3 scripts:
.venv-modal/bin/modal run scripts/modal_train_8x_h100.py --run-id my-modal-8xscripts/modal_train_lawa.py and scripts/modal_train_kure_r2_ttt.py install prebuilt flash_attn_3 wheels (windreamer index) — no git clone / Hopper compile during image build, so new Modal workspaces rebuild in seconds.
Base image and wheel index live in scripts/modal_image_fa3_pytorch.py (PYTORCH_FA3_BASE, FA3_WHEEL_FIND_LINKS). Installs use Modal’s documented uv_pip_install path for registry images (Modal Image reference) so these scripts avoid the PEP 668 externally-managed-environment failure that raw pip_install can hit on PyTorch Hub images. If the FA3 wheel cannot be found, change base + find-links together to a matching row on the windreamer page.
Edit LOCAL_TRAIN_GPT at the top of the script to your record path, e.g.
records/track_10min_16mb/<your_record>/train_gpt.py.
.venv-modal/bin/modal run scripts/modal_train_lawa.py --run-id lawa-test-001
.venv-modal/bin/modal run scripts/modal_train_kure_r2_ttt.py --run-id kure-r2-ttt-001Validate FA3 image + volume (1× H100 briefly, ~1 min after image build):
Checks torch, flash_attn_interface, and that parameter-golf-data has datasets/fineweb10B_sp1024 + tokenizers/fineweb_1024_bpe.model (loads SentencePiece). Run ./scripts/modal_sync_data.sh first on the same Modal account.
Wait until any in-flight Modal training run has finished before running this, so you do not grab another H100 or split attention while the main job completes.
.venv-modal/bin/modal run scripts/modal_fa3_image_smoke.pyA successful smoke run ends with [modal] FA3 image + volume checks OK.. If you see No module named 'modal_image_fa3_pytorch', you are running an older checkout; pull the latest branch so Modal mounts both helper modules into the container.
Cross-account reuse (Docker-style): Modal does not export internal im-… image tarballs; cache is per workspace. To share one environment everywhere, build scripts/Dockerfile.modal-fa3, push to GHCR (or Docker Hub), then replace the image in the script with modal.Image.from_registry("ghcr.io/<you>/parameter-golf-modal:<tag>") and only add_local_file for code changes.
Training uses cwd=/vol/runs/<run_id>, so logs and final_model.* persist under volume path runs/<run_id>/.
Commit is performed automatically at the end of each training function.
Verify the volume (path defaults to / on the volume — you should see datasets/, tokenizers/, and after a good run runs/):
.venv-modal/bin/modal volume ls parameter-golf-data
.venv-modal/bin/modal volume ls parameter-golf-data runsDownload to your machine:
mkdir -p modal_runs
.venv-modal/bin/modal volume get parameter-golf-data runs/<run_id> ./modal_runs/<run_id>If volume get says No such file or directory: the run likely used an older Modal definition that wrote logs and checkpoints under /root (container disk). That data is not on parameter-golf-data and is gone after the container exits. Recover by copying function logs / stdout from the Modal app run page (e.g. Go to function logs). Then git pull and re-run with the current scripts/modal_train_lawa.py (uses /vol/runs/<run_id> + chdir + commit); the log line [modal] volume path '/vol/runs/…' after train: [...] confirms files landed on the volume.
Then bundle from that directory:
./scripts/bundle_run_results.sh "<run_id>" "./modal_runs/<run_id>_bundle.tar.gz" "./modal_runs/<run_id>"When you have credits:
- VM: NVIDIA GPU + CUDA driver; use the same Python / PyTorch / deps stack as RunPod eval (see
requirements.txtand upstream challenge docs). - Disk: attach a persistent data disk for the repo +
data/(like a RunPod network volume). - Results: same
bundle_run_results.sh+ optionalmake_run_meta.sh/extract_run_metrics.py, thengsutil cp/az storage blob upload(or SCP).
Add concrete machine images and commands here once you standardize on a SKU.
| Symptom | Likely cause |
|---|---|
git: not found during Modal image build |
Only if you still git clone in the image; LAWA/KURE scripts use prebuilt FA3 wheels instead. |
| Modal build very slow once | Usually large pip layers or a mismatched FA3 wheel index (pip falls back to source build). Match FA3_WHEEL_FIND_LINKS to PYTORCH_FA3_BASE in scripts/modal_image_fa3_pytorch.py per windreamer. Cached per workspace — see Modal images. |
externally-managed-environment during Modal image build |
PyTorch Hub image: use uv_pip_install (this repo does that in modal_image_fa3_pytorch.py), not bare pip_install on that base. |
| SCP fails on RunPod | Use runpodctl send/receive or full SSH — see links above. |
| Empty folder after Modal run | Ensure training finished; check modal volume ls parameter-golf-data runs/ — volume writes need commit() (handled in our scripts). |
modal volume get → No such file or directory for runs/<id> |
runs/ was never created: old deploy wrote to /root only (ephemeral). List root: modal volume ls parameter-golf-data. Recover from Modal UI logs; redeploy latest scripts. |
Not found: "/vol/tokenizers/fineweb_1024_bpe.model" (SentencePiece) |
Volume never seeded for this Modal account. Run ./scripts/modal_sync_data.sh from the repo (needs local data/datasets/fineweb10B_sp1024 + data/tokenizers/fineweb_1024_bpe.model). |
Timed out waiting for final app logs (local CLI) |
Often harmless; remote job may still finish. Confirm in the Modal app run page. |