Skip to content

Add runpod_record_attempt.sh to automate multi-GPU, multi-seed SOTA run#4

Merged
teslaeco merged 1 commit intomainfrom
codex/task-title-xob60q
Apr 18, 2026
Merged

Add runpod_record_attempt.sh to automate multi-GPU, multi-seed SOTA run#4
teslaeco merged 1 commit intomainfrom
codex/task-title-xob60q

Conversation

@teslaeco
Copy link
Copy Markdown
Member

Motivation

  • Provide a one-shot reproducible script to run the official SOTA training record on a fresh RunPod instance and ensure environment setup is consistent for multi-GPU H100 runs.
  • Ensure dependencies are installed (with a graceful fallback for optional packages) and three specified seeds are executed with clear timestamped logging.

Description

  • Add runpod_record_attempt.sh at the repository root that clones/syncs the openai/parameter-golf repository into a work directory and checks out main as needed.
  • Install Python dependencies via pip (torch, sentencepiece) and attempt to install flash_attn non-fatally if it fails.
  • Configure common multi-GPU environment variables (e.g., CUDA_VISIBLE_DEVICES, NCCL_DEBUG, OMP_NUM_THREADS) for an 8x H100 setup and detect available CUDA devices using torch.cuda.device_count().
  • Validate the target record directory and train_gpt.py exist, then run torchrun --standalone --nproc_per_node=${NUM_GPUS} to launch training for seeds 42, 314, and 999, teeing each run to runpod_seed<seed>.log and printing timestamped progress to stdout; the script is made executable.

Testing

  • bash -n runpod_record_attempt.sh was executed to validate shell syntax and it passed successfully.
  • stat -c '%A %n' runpod_record_attempt.sh was run to verify the executable bit and returned -rwxr-xr-x, confirming the script is executable.

Codex Task

@teslaeco teslaeco merged commit 898f566 into main Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant