Skip to content

Train a reinforcement learning agent to balance the CartPole using Proximal Policy Optimization (PPO) with Stable-Baselines3.

License

Notifications You must be signed in to change notification settings

amugoodbad229/CartPoleRL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

67 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ‹οΈβ€β™‚οΈ CartPoleRL

Train and ship a PPO agent for CartPole using Stable-Baselines3. The project uses uv for fast, reproducible environments and includes ready-to-run experiment variants, TensorBoard monitoring, ONNX export, and seamless ProtoTwin integration for inference.

It’s designed to take you from a naive baseline to industry‑standard practice with structured experiments, reproducible training, model packaging, and deployment. You can train on ProtoTwin, export to ONNX, and run inference on ProtoTwin by writing simple control logic in TypeScript.

Quickstart β€’ Structure β€’ Training β€’ Monitoring β€’ ONNX β€’ Troubleshooting β€’ Roadmap


✨ Features

Area Capability
Algorithms PPO (easily extensible to A2C, DQN, etc.)
Experimentation Versioned training entrypoints: main-v1.py, main-v2.py, ...
Monitoring TensorBoard logs per variant: tensorboard-v1/, tensorboard-v2/
Export ONNX conversion via export_onnx.py
Deployment Ready for ProtoTwin (upload ONNX or Python policy)
Extensibility Clean project layout – add new envs or models fast
Dev UX Minimal commands to get started

πŸ“š Important Links

Important

β€’ Notes PDF: Important understandings
β€’ Curated Resource: TLDRAW board


βœ… Prerequisites

  • Python 3.10+
  • ProtoTwin Connect for training and deployment
  • NVIDIA GPU + CUDA-capable PyTorch build (Optional)

Note

The project will work on CPU, no problem if you do not have a GPU


⚑ Quickstart

git clone https://github.com/amugoodbad229/CartPoleRL.git
cd CartPoleRL

Install uv (one time):

# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
winget install --id=astral-sh.uv -e

# Check installation
uv --version

Sync environment + dependencies:

uv sync

Activate environment:

# Linux/macOS
source .venv/bin/activate

# Windows PowerShell
.venv\Scripts\Activate.ps1

Run a training variant:

ls main-v*.py          # discover available variants
python main-v1.py      # or main-v2.py, etc.

# OPTIONAL: For custom CLI commands
python main-v1.py --num_envs 32 --initial_lr 0.001 --num_timesteps 500000

πŸ“Š Monitoring

Launch TensorBoard (choose the appropriate variant path):

python -m tensorboard.main --logdir tensorboard-v1

Or to watch all:

python -m tensorboard.main --logdir .

Tip

If nothing appears, ensure training produced events:
find tensorboard-v1 -type f -name "*tfevent*"


πŸ§ͺ Training & Experiment Variants

Each main-vX.py file encapsulates a slightly different configuration:

  • Hyperparameters (Learning rate, gamma, entropy)
  • Network architecture (Default or Custom)
  • Logging folder (Agent Models)
  • Callback setup

Tip

Duplicate an existing file to create a new experiment:
cp main-v1.py main-v3.py β†’ edit run name, log path, and hyperparameters.

Suggested Naming Convention

Variant Purpose
main-v0.py Baseline PPO
main-v1.py Tuned learning rate / entropy
main-v2.py Different network width
main-v3.py Longer training horizon
main-vN.py Custom experiment

πŸ“¦ ONNX Export & Deployment

Generate an ONNX policy (after training):

python export_onnx.py

Note

If the script uses hardcoded paths, edit export_onnx.py or extend it with argparse.

ProtoTwin usage: Upload the ONNX file to ProtoTwin or deploy the Python inference.


🧱 Project Structure

.
β”œβ”€β”€ main-v1.py             # Training variant 1
β”œβ”€β”€ main-v2.py             # Training variant 2 (extend as needed)
β”œβ”€β”€ export_onnx.py         # Convert trained model to ONNX
β”œβ”€β”€ logs-v1/               # Training logs + checkpoints (variant 1)
β”‚   └── checkpoints/
β”œβ”€β”€ tensorboard-v1/        # TensorBoard event files (variant 1)
β”œβ”€β”€ pyproject.toml         # Project + dependency definitions
β”œβ”€β”€ uv.lock                # Locked, reproducible dependency set
└── README.md

Note

Additional variants (e.g., logs-v2/, tensorboard-v2/) appear after running those scripts.


πŸ”§ Extending the Project

Task How
Add a new algorithm Replace PPO import with another SB3 algorithm
Add custom policy Define policy_kwargs in the training script
Change environment Swap CartPole-v1 with another Gymnasium env
Add callbacks Implement BaseCallback and register in training
Log extra metrics Use custom callback + self.logger.record()

πŸ§ͺ Evaluating a Policy

Add (or use) a snippet like:

from stable_baselines3.common.evaluation import evaluate_policy
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=10)
print(f"Mean: {mean_reward:.2f} Β± {std_reward:.2f}")

πŸ›  Useful Git Commands

git status
git add .
git commit -m "Experiment: tuned lr and entropy"
git push origin main

Tip

Use branches for big experiments:
git checkout -b feat/entropy-sweep


πŸš‘ Troubleshooting

Symptom Fix
uv: command not found Reinstall uv, restart terminal
No TensorBoard data Confirm correct tensorboard-vX/ path
CPU instead of GPU Check: python -c "import torch; print(torch.cuda.is_available())"
ImportError (SB3) Run uv sync again (env might be stale)
Permission denied on activate On Unix: chmod +x .venv/bin/activate (rare)

Caution

Paths are case-sensitive. Use cd CartPoleRL, not cd cartpolerl.


🧭 Roadmap

  • Add evaluation script (e.g., evaluate.py)
  • Hyperparameter sweeps integration (Optuna / WandB)
  • Dockerfile for containerized deployment
  • Unified config system (config/ + YAML)
  • CI workflow (lint + format + smoke test)

🀝 Contributing

  1. Fork the repo
  2. Create a feature branch: git checkout -b feat/new-idea
  3. Submit PR with: description, metrics, rationale

πŸ™ Acknowledgements


πŸ“„ License

MIT License Β© 2025 Ayman Khan

⭐ Support

If this helps you learn or prototype faster:

  • Star the repo
  • Share feedback
  • Open issues for improvements

Happy balancing! πŸ› οΈπŸ§ 

About

Train a reinforcement learning agent to balance the CartPole using Proximal Policy Optimization (PPO) with Stable-Baselines3.

Topics

Resources

License

Stars

Watchers

Forks