Train and ship a PPO agent for CartPole using Stable-Baselines3. The project uses uv for fast, reproducible environments and includes ready-to-run experiment variants, TensorBoard monitoring, ONNX export, and seamless ProtoTwin integration for inference.
Itβs designed to take you from a naive baseline to industryβstandard practice with structured experiments, reproducible training, model packaging, and deployment. You can train on ProtoTwin, export to ONNX, and run inference on ProtoTwin by writing simple control logic in TypeScript.
Quickstart β’ Structure β’ Training β’ Monitoring β’ ONNX β’ Troubleshooting β’ Roadmap
| Area | Capability |
|---|---|
| Algorithms | PPO (easily extensible to A2C, DQN, etc.) |
| Experimentation | Versioned training entrypoints: main-v1.py, main-v2.py, ... |
| Monitoring | TensorBoard logs per variant: tensorboard-v1/, tensorboard-v2/ |
| Export | ONNX conversion via export_onnx.py |
| Deployment | Ready for ProtoTwin (upload ONNX or Python policy) |
| Extensibility | Clean project layout β add new envs or models fast |
| Dev UX | Minimal commands to get started |
Important
β’ Notes PDF: Important understandings
β’ Curated Resource: TLDRAW board
- Python 3.10+
- ProtoTwin Connect for training and deployment
- NVIDIA GPU + CUDA-capable PyTorch build (Optional)
Note
The project will work on CPU, no problem if you do not have a GPU
git clone https://github.com/amugoodbad229/CartPoleRL.git
cd CartPoleRLInstall uv (one time):
# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows
winget install --id=astral-sh.uv -e
# Check installation
uv --versionSync environment + dependencies:
uv syncActivate environment:
# Linux/macOS
source .venv/bin/activate
# Windows PowerShell
.venv\Scripts\Activate.ps1Run a training variant:
ls main-v*.py # discover available variants
python main-v1.py # or main-v2.py, etc.
# OPTIONAL: For custom CLI commands
python main-v1.py --num_envs 32 --initial_lr 0.001 --num_timesteps 500000Launch TensorBoard (choose the appropriate variant path):
python -m tensorboard.main --logdir tensorboard-v1Or to watch all:
python -m tensorboard.main --logdir .Tip
If nothing appears, ensure training produced events:
find tensorboard-v1 -type f -name "*tfevent*"
Each main-vX.py file encapsulates a slightly different configuration:
- Hyperparameters (Learning rate, gamma, entropy)
- Network architecture (Default or Custom)
- Logging folder (Agent Models)
- Callback setup
Tip
Duplicate an existing file to create a new experiment:
cp main-v1.py main-v3.py β edit run name, log path, and hyperparameters.
| Variant | Purpose |
|---|---|
main-v0.py |
Baseline PPO |
main-v1.py |
Tuned learning rate / entropy |
main-v2.py |
Different network width |
main-v3.py |
Longer training horizon |
main-vN.py |
Custom experiment |
Generate an ONNX policy (after training):
python export_onnx.pyNote
If the script uses hardcoded paths, edit export_onnx.py or extend it with argparse.
ProtoTwin usage: Upload the ONNX file to ProtoTwin or deploy the Python inference.
.
βββ main-v1.py # Training variant 1
βββ main-v2.py # Training variant 2 (extend as needed)
βββ export_onnx.py # Convert trained model to ONNX
βββ logs-v1/ # Training logs + checkpoints (variant 1)
β βββ checkpoints/
βββ tensorboard-v1/ # TensorBoard event files (variant 1)
βββ pyproject.toml # Project + dependency definitions
βββ uv.lock # Locked, reproducible dependency set
βββ README.md
Note
Additional variants (e.g., logs-v2/, tensorboard-v2/) appear after running those scripts.
| Task | How |
|---|---|
| Add a new algorithm | Replace PPO import with another SB3 algorithm |
| Add custom policy | Define policy_kwargs in the training script |
| Change environment | Swap CartPole-v1 with another Gymnasium env |
| Add callbacks | Implement BaseCallback and register in training |
| Log extra metrics | Use custom callback + self.logger.record() |
Add (or use) a snippet like:
from stable_baselines3.common.evaluation import evaluate_policy
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)
print(f"Mean: {mean_reward:.2f} Β± {std_reward:.2f}")git status
git add .
git commit -m "Experiment: tuned lr and entropy"
git push origin mainTip
Use branches for big experiments:
git checkout -b feat/entropy-sweep
| Symptom | Fix |
|---|---|
uv: command not found |
Reinstall uv, restart terminal |
| No TensorBoard data | Confirm correct tensorboard-vX/ path |
| CPU instead of GPU | Check: python -c "import torch; print(torch.cuda.is_available())" |
| ImportError (SB3) | Run uv sync again (env might be stale) |
| Permission denied on activate | On Unix: chmod +x .venv/bin/activate (rare) |
Caution
Paths are case-sensitive. Use cd CartPoleRL, not cd cartpolerl.
- Add evaluation script (e.g.,
evaluate.py) - Hyperparameter sweeps integration (Optuna / WandB)
- Dockerfile for containerized deployment
- Unified config system (
config/+ YAML) - CI workflow (lint + format + smoke test)
- Fork the repo
- Create a feature branch:
git checkout -b feat/new-idea - Submit PR with: description, metrics, rationale
MIT License Β© 2025 Ayman Khan
If this helps you learn or prototype faster:
- Star the repo
- Share feedback
- Open issues for improvements
Happy balancing! π οΈπ§