Hands-On Modern RL

A hands-on modern reinforcement learning course

A practice-first guide to modern RL, from classic control to LLM post-training, RLVR, and multimodal agents.

Course Preview · Overview · News · Contents · Course Outline · Experiment Code · Quick Start · Contributing

Course Preview

A clear learning map _{From the preface and foundations to frontier topics, the chapter tree and page outline help you navigate quickly.}	Line-by-line code focus _{Key PPO, DPO, and GRPO implementations include code maps that connect formulas to readable code.}
Training metric visualization _{Real curves, metric explanations, and failure signals sit together so you can debug while running experiments.}	LLM post-training pipelines _{RLHF, DPO, GRPO, RLVR, and related topics are tied together through flows, artifacts, and cases.}
Project-oriented Agentic RL _{Tool use, trajectory synthesis, evaluation, and multi-tool code agents turn into full engineering exercises.}	Multimodal and frontier directions _{VLM reinforcement learning, visual generation RL, embodied intelligence, and future trends extend into frontier systems.}

Note

We hope this open course gives more learners the courage to climb toward the frontier of intelligence and solve more of the hard problems on the path to AGI.

The course is evolving quickly. We recommend focusing on chapters that are not marked as under construction; chapters still in progress may contain mistakes, and corrections or suggestions are welcome.

Warning

The LLM RL and Agentic RL sections have not yet been fully reviewed or corrected. Please read them carefully.

Help Wanted

Because compute resources are limited, we are seeking GPU support. If you can help with GPU access, please contact physicoada@gmail.com.

Overview

Hands-On Modern RL is an open course for learning modern reinforcement learning through practice. Instead of the usual "formula first, black-box API later" route, this course takes a practice-first path: learners begin with runnable code and observable training behavior, then use those concrete traces to understand states, value functions, policy gradients, reward modeling, credit assignment, and the rest of the mathematical structure behind RL.

The course spans classic control and connects directly to current AI frontiers, including large language model (LLM) post-training, preference alignment with DPO and GRPO, reinforcement learning with verifiable rewards (RLVR), multi-turn tool-use agents, Agentic RL, and vision-language model (VLM) reinforcement learning.

The goal is to provide a solid ladder: from solving CartPole for the first time to building modern post-training and agent systems.

Design Principles

The course is organized around these engineering and teaching principles:

Practice before formalism. Each major topic starts from experiments, metrics, failure cases, or implementation details, then introduces the mathematical abstraction.
Theory explains behavior. MDPs, Bellman equations, policy gradients, GAE, PPO clipping, DPO objectives, and GRPO-style group advantages are introduced as tools for explaining what the code does.
Modern RL goes beyond classic RL. The course covers classic control and deep RL, then moves into RLHF, preference optimization, RLVR, VLM reinforcement learning, and multi-turn agent training.
Debugging is first-class. Training collapse, reward hacking, KL drift, entropy decay, OOM failures, and evaluation blind spots are treated as core material.
Readable systems beat black boxes. Examples favor explicit implementations, inspectable metrics, and clear experiment boundaries so learners can modify and extend them.

Audience

This course is for learners who want to understand reinforcement learning by building and inspecting working systems.

It is especially useful for:

Machine learning engineers moving from supervised learning into RL.
Researchers and students preparing to read modern RL and alignment papers.
LLM practitioners interested in RLHF, DPO, GRPO, RLVR, and post-training systems.
Builders of tool-use agents, web agents, code agents, and evaluation pipelines.
Self-learners who prefer code, experiments, and visual intuition before dense derivations.

Recommended background:

Python programming experience.
Basic PyTorch familiarity.
Introductory linear algebra, probability, and calculus for machine learning.
Ability to read papers and trace open-source training scripts.

The course includes math review appendices, so full mathematical fluency is not required on day one.

Learning Goals

After completing the course, learners should be able to:

Implement and explain the core RL loop: environment interaction, trajectory collection, reward feedback, policy updates, and evaluation.
Connect MDPs, value functions, Bellman equations, TD learning, policy gradients, and advantage estimation to concrete training behavior.
Read and modify implementations of DQN, REINFORCE, Actor-Critic, PPO, DPO, GRPO, and related methods.
Reason about LLM post-training pipelines, including SFT, reward modeling, PPO-style RLHF, DPO-family methods, and RLVR training.
Understand multi-turn interaction and credit assignment, and build tool-use, trajectory-synthesis, and Agentic RL systems.
Extend reinforcement learning ideas to VLMs, embodied intelligence, multi-agent self-play, and other frontier areas.
Diagnose common RL failure modes and design reasonable algorithms, engineering evaluations, and debugging workflows for new RL problems.

Current Status

This repository is an active courseware project. Content is being expanded chapter by chapter, with emphasis on correctness, runnable examples, and a stable learning path.

Course site: walkinglabs.github.io/hands-on-modern-rl
Source content: docs/
Runnable examples: code/
Local verification: npm run verify
License: CC BY-NC-SA 4.0

Issues and pull requests are welcome for typo fixes, conceptual corrections, reproducibility improvements, references, and focused course extensions.

News

Note: This course was created with AI assistance and has not yet been fully reviewed. It may contain factual mistakes or code that does not run as expected. Issues and pull requests are very welcome.

[2026-05-02] Initial browsable open-source release for testing and feedback.

Roadmap

The course is under active development. Planned milestones:

2026-05-02: Initial open-source browsable release for community testing and feedback.
2026-05-10: Publish a first stable minor version, fix early typos, and stabilize Part 1 and Part 2 content and code.
Late May 2026: Improve reproducible LLM RL experiments and add a full RLVR hands-on module with evaluation.
Early June 2026: Deliver Agentic RL projects step by step, from single-tool use to complex Deep Research trajectory synthesis.
Late June 2026: Add Unity-based embodied RL environments and trainable project examples.
July 2026 and later: Expand multimodal frontier content with full VLM RL or Diffusion RL hands-on cases.

Course Outline

The course is divided into four parts plus appendices. The online site includes full text, diagrams, code references, and chapter navigation.

Preface

Topic	Description
Course Guide	Course positioning, learning path, and how to use the materials.
A Brief History of Reinforcement Learning	From trial-and-error learning to AlphaGo, RLHF, and LLM alignment.
Environment Setup	Installation and dependency setup for the course.

Part 1: Foundations by Practice

Chapter	Topic	Core Question
01	CartPole	In a real environment, what are states, actions, rewards, policies, values, entropy, and training curves?
1.1	States, Actions, Rewards, and Policies	What basic objects make up an RL problem?
1.2	Reward, Entropy, Value Loss, and KL	What do the key training-curve metrics tell us?
02	DPO Preference Fine-tuning	How does preference optimization change model behavior, and what do loss, reward margin, and accuracy mean?
2.1	Post-Training Pipeline and DPO Derivation	How does DPO derive a training objective from preference data and a reference model?
2.2	Loss, Reward Margin, and Accuracy	How should DPO training metrics be interpreted?
Summary	Part 1 Summary	What intuition should be in place before formal theory?

Part 2: Core Theory and Methods

Chapter	Topic	Core Question
03	MDPs and Value Functions	How do bandits, MDPs, value functions, Bellman equations, and TD error formalize sequential decision-making?
3.1	Two-Armed Bandit Problem	How does the simplest trial-and-error problem show exploration and exploitation?
3.2	Markov Decision Processes	How do states, actions, transitions, rewards, and discounting define a sequential decision model?
3.3	Value Functions and Bellman Equations	How can a value function recursively evaluate a situation?
3.4	DP, MC, and TD	How do dynamic programming, Monte Carlo, and temporal-difference learning estimate value?
3.5	From Q to Q-Learning	How does action value turn "is this state good?" into "which action should I choose?"
3.6	From Value to Policy	When directly optimizing a policy, what exactly does the objective maximize?
3.7	Where Data Comes From	How do on-policy, off-policy, and data sources affect algorithm design?
3.8	Reward Function Design	How can reward functions guide learning, and how can they be misused?
3.9	Chapter Summary	How do the MDP chapter concepts connect into an algorithm map?
04	Deep Q-Networks	Why are replay buffers, target networks, CNN encoders, and DQN variants important?
4.1	Why Deep Q-Networks Are Needed	How do neural networks replace tables for approximating Q functions?
4.2	The Three Components of DQN	What stability problems do replay, target networks, and encoders solve?
4.3	LunarLander Training Analysis	What do DQN training curves and Q-value changes reveal?
4.4	LunarLander Hands-On	How does DQN land on a fuller control task, and how should it be tuned?
4.5	The Deep Q-Network Family	How did the DQN family fix overestimation, representation, and sampling issues?
4.6	Visual Game Projects	What engineering changes are needed when moving from low-dimensional control to visual games?
05	Policy Gradient and REINFORCE	How can policies be optimized directly, and why do baselines reduce gradient variance?
5.1	Hands-On: Dice Gambling Bandit	How does a minimal experiment reveal policy-gradient sampling updates?
5.2	Policy Gradient and REINFORCE	How does REINFORCE increase the probability of high-return actions?
5.3	Hands-On: Baseline Variance Reduction	Why does a baseline reduce variance without changing the expectation?
06	Actor-Critic	How do actor and critic split the learning problem, and how does TD error become an advantage signal?
6.1	Advantage Function	How does advantage answer "how much better was this action than average?"
6.2	Training the Critic with TD Error	How does the critic learn value estimates from bootstrapped signals?
6.3	Actor-Critic Architecture	How do actor and critic work together in one training loop?
6.4	Project: A Simple AlphaGo Reproduction	How do policy networks, value networks, and search combine into a game-playing agent?
07	PPO	How do clipping, trust-region intuition, GAE, and reward models stabilize policy optimization?
7.1	Hands-On: PPO on LunarLander	How does PPO behave on a more complex control task, and how should it be tuned?
7.2	PPO Math Derivation	How does the PPO objective move from policy gradient to clipped surrogate objective?
7.3	Trust Regions and Clipping	How does clipping limit policy update size?
7.4	GAE and Reward Models	How does GAE balance bias and variance, and how does it connect to reward model training?
Summary	Part 2 Summary	What algorithmic patterns repeat across classic and modern RL?

Part 3: LLM-era RL

Chapter	Topic	Core Question
08	The Full RLHF Pipeline	How do instruction data, reward models, PPO training, evaluation, and scaling fit together?
8.1	From Model to Assistant	What is the gap between a pretrained model and an assistant model?
8.2	The RLHF Pipeline	How do SFT, RM, and RL connect as three training stages?
8.3	Instruction Fine-Tuning	How does supervised fine-tuning build basic instruction-following ability?
8.4	Reward Models	How does a reward model turn human preferences into an optimizable signal?
8.5	PPO Fine-Tuning	How does PPO optimize a language model under a KL constraint?
8.6	Evaluating Improvement	How can we tell whether alignment training improved the model?
8.7	Scaling to Large Models	What engineering problems appear when the same RLHF pipeline is scaled up?
8.8	Reward Hacking	How can reward gaming be detected, and how can data iteration keep improving the model?
09	Post-Training Alignment	How do DPO, GRPO, DeepSeek-R1, and verifiable rewards train reasoning behavior?
9.1	Preference Optimization Methods	How does the preference-optimization family bypass explicit reward models?
9.2	DPO Experiment	How can a DPO training experiment be run and inspected end to end?
9.3	GRPO	How does GRPO replace a critic with within-group relative advantage?
9.4	R1 and DAPO	What new RL lessons appear in reasoning-model training?
9.5	Verifiable Rewards	How can rule-checkable tasks provide stable rewards for RL?
9.6	Policy Distillation	How can online RL behavior be distilled back into a more usable model?
9.7	Post-Training Practice	How does LLM post-training land in data, rewards, evaluation, and engineering loops?
10	Agentic RL	How do multi-turn interaction, tool use, trajectory synthesis, and agent systems engineering change RL problems?
10.1	Multi-Turn Interaction	In multi-step tasks, how can final outcomes be assigned back to intermediate actions?
10.2	Tool Use	How do tool execution results enter RL trajectories and training data?
10.3	Evaluation and Cases	What failure modes appear most often in engineering evaluation for Agentic RL?
10.4	Code Agent	How can a model be trained to switch among search, coding, and testing?
10.5	Deep Research	How do research agents organize search, citations, and answer-quality rewards?
10.6	Extended Readings	What should learners read next to go deeper into Agentic RL?
Summary	Part 3 Summary	What makes RL for LLMs different from RL in classic environments?

Part 4: Frontier and Advanced Systems

Chapter	Topic	Core Question
11	VLM Reinforcement Learning	How do visual rewards, multimodal frameworks, and visual generation RL change the training loop?
11.1	Training VLMs	How can GRPO training be extended to visual question answering tasks?
11.2	Visual Rewards	What new problems do multimodal rewards and visual hallucinations introduce?
11.3	VLM Reasoning Frameworks	How do frontier VLM-RL frameworks organize data, rewards, and training?
11.4	Visual Generation	How can image generation models be optimized with preferences and rewards?
12	Future Trends	Where are embodied intelligence, model-based RL, self-play, multi-agent RL, and offline RL going?
12.1	Embodied Intelligence	How does RL enter robotics and the physical world?
12.2	Model-Based Reinforcement Learning	How can world models reduce the cost of real environment interaction?
12.3	Self-Play	How can self-play drive continuous capability improvement?
12.4	Multi-Agent Systems	How can multiple language agents collaborate, compete, and learn together?
12.5	Offline Reinforcement Learning	How can a policy be learned from fixed data when online trial and error is unavailable?
12.6	Scaling Trends	Where might large-scale RL training go next?
Summary	Part 4 Summary	What directions should learners follow after finishing the core course?

Appendices

Appendix	Topic	Description
A	Training Debugging Guide	Failure modes, symptoms, root causes, and fixes for RL training.
B	RL Engineering Practice	Training infrastructure, agent sandboxes, evaluation benchmarks, and industrial exercises.
B.1	Training System Foundations	What infrastructure does an RL training system need?
B.2	Agent Sandboxes and Tool Scheduling	How should tool-use agent training isolate execution environments?
B.3	RL and Agent Benchmarks	How should evaluations and bad-case analysis be designed?
B.4	Training Metrics Glossary	What do common training metrics indicate?
B.5	Industrial Practice Exercises	How can engineering concepts be turned into practice tasks?
C	Handwritten Code Cheatsheet	Core code notes for SFT, PPO, DPO, GRPO, sampling, attention, and DAPO.
C.1	SFT and KL	How do instruction tuning and KL constraints appear in code?
C.2	PPO and GAE	How can the key PPO and GAE calculations be written by hand?
C.3	The DPO Family	How do DPO-family objectives map to minimal implementations?
C.4	GRPO and Reward Models	How do group advantages and reward signals enter the training loop?
C.5	Softmax and Cross-Entropy	What is the basic code behind classification and language-model losses?
C.6	Sampling Methods	How are generation sampling methods such as top-k and top-p implemented?
C.7	Attention Mechanisms	What are the core tensor transformations in multi-head attention?
C.8	DAPO	How can DAPO's key training tricks become code checkpoints?
D	Learning Resources and Reproduction Projects	Curated resources and reproduction projects for expanding course examples.
E	Math Foundations for Reinforcement Learning	Linear algebra, probability and statistics, calculus and optimization, and information theory for RL.
E.1	Math Objects and Linear Algebra	How do vectors, matrices, and function approximation support RL representations?
E.1.1	Basic Objects	How do scalars, vectors, matrices, and tensors organize RL data?
E.1.2	Bellman Matrices	How can Bellman equations be written in linear-algebra form?
E.1.3	Function Approximation	How do linear layers and feature representations approximate values or policies?
E.1.4	Convergence and Trust Regions	How do spectra, norms, and approximation error explain stability?
E.1.5	Formulas and Exercises	How can small exercises strengthen linear-algebra tools?
E.2	Probability, Expectation, and Stochastic Estimation	What probability tools do returns, sampling, and trajectory estimation depend on?
E.2.1	Probability Basics	How do random variables, conditional probability, and distributions enter RL?
E.2.2	Returns and Values	Why is a value function fundamentally a conditional expectation?
E.2.3	Sampling Estimation	How can samples estimate expectations and gradients?
E.2.4	Trajectories and GAE	How are trajectory distributions, TD error, and GAE related?
E.2.5	Bellman Expectations	What does the Bellman expectation equation mean probabilistically?
E.2.6	Formulas and Exercises	How can common probability and stochastic-estimation formulas be checked?
E.3	Calculus and Optimization	How do gradients, the chain rule, and optimizers drive policy updates?
E.3.1	Derivatives and Gradients	How does a gradient tell the policy which direction to move?
E.3.2	Policy Gradients	How does the policy gradient theorem follow from the objective?
E.3.3	PPO and Adam	What calculus intuition appears in PPO objectives and Adam updates?
E.3.4	Derivation Tools	Which transformations are easiest to get wrong in common derivations?
E.3.5	Complete Formulas	How do advanced formulas help with reading algorithm papers?
E.3.6	Formulas and Exercises	How can exercises reinforce gradient and optimization formulas?
E.4	Information Theory and Distribution Distance	How do entropy, cross-entropy, and KL explain exploration and alignment constraints?
E.4.1	Entropy and Exploration	How does entropy measure whether a policy is still exploring?
E.4.2	Cross-Entropy and KL	Why can KL constrain old and new policy or model distributions?
E.4.3	RLHF and DPO	What are the distribution-distance and reward interpretations in preference optimization?
E.4.4	Mutual Information	How does mutual information describe shared information between variables?
E.4.5	Complete Formulas	How do advanced information-theory formulas serve RL and alignment derivations?
E.4.6	Formulas and Exercises	How can entropy, cross-entropy, and KL calculations be practiced?

Experiment Code

The code/ directory contains runnable examples aligned with course chapters. Each chapter's code is intentionally compact so it can be inspected, run, and modified independently.

Area	Code Path	Representative Experiments
Classic control	`code/chapter01_cartpole/`	Train CartPole, inspect rewards and episode length, and compare PPO implementations.
Preference fine-tuning	`code/chapter02_dpo/`	Generate preference data, train with DPO, and compare model behavior before and after fine-tuning.
MDP and value learning	`code/chapter03_mdp/`	Run bandit strategies, solve GridWorld, and verify Bellman updates numerically.
Deep Q-learning	`code/chapter04_dqn/`	Implement replay buffers, target networks, and Double DQN variants.
Policy gradient	`code/chapter05_policy_gradient/`	Compare REINFORCE, baseline variants, and Actor-Critic updates.
PPO	`code/chapter07_ppo/`	Train LunarLander, inspect clipping, visualize GAE, and compare training stability.
RLHF	`code/chapter08_rlhf/`	Walk through SFT, reward model training, and PPO-style alignment.
Alignment and RLVR	`code/chapter09_alignment/`, `code/chapter09_grpo_rlvr/`	Explore DPO rewards, GRPO group advantages, and rule-based verifiable rewards.
VLM and agents	`code/chapter10_agentic_rl/`, `code/chapter11_vlm_rl/`	Build tool-use agent trajectory synthesis and implement multimodal model RL examples.
Advanced topics	`code/chapter12_future_trends/`	Study frontier directions including multi-agent RL and model-based RL.

See code/README.md for a code index and chapter-specific dependency notes.

Recommended Learning Path

A practical path through the repository:

Read the course guide and run the CartPole example.
Skim the DPO chapter early, even before finishing all theory, to anchor the motivation for LLM post-training.
Study Chapters 03-07 in order; this is the conceptual core.
After understanding policy gradients and PPO, return to RLHF, DPO, GRPO, and RLVR.
Use the debugging and engineering appendices whenever a training run behaves strangely.
Treat frontier chapters as extensions: VLM reinforcement learning, Agentic RL, continuous control, multi-agent systems, and test-time reasoning.

Quick Start

Read Online

Published course site:

https://walkinglabs.github.io/hands-on-modern-rl/

Run the Documentation Site Locally

Requirements:

Node.js >= 18.0.0
npm

git clone https://github.com/walkinglabs/hands-on-modern-rl.git
cd hands-on-modern-rl
npm install
npm run dev

Then open the local VitePress URL shown in the terminal, usually:

http://localhost:5173

Verify the Site

Before submitting a pull request that changes documentation structure, theme code, navigation, build scripts, or generated assets, run:

npm run verify

This checks formatting, lints the VitePress theme, builds the site, and verifies expected build artifacts.

Run Course Code

Most code examples use Python and are organized by chapter.

cd code
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

For smaller installs, use chapter-specific requirements files:

pip install -r chapter01_cartpole/requirements.txt
python chapter01_cartpole/1-ppo_cartpole.py

Some chapters may require additional system libraries, GPU support, model downloads, or environment-specific setup. Start with Chapter 01 before running examples that involve LLMs, VLMs, or heavy simulators.

Repository Structure

hands-on-modern-rl/
|-- docs/                      # VitePress course content
|   |-- .vitepress/            # Site config, navigation, and theme overrides
|   |-- public/                # Static assets copied into the built site
|   |-- preface/               # Course introduction and history
|   |-- chapter*/              # Main course chapters
|   |-- appendix*/             # Supplementary material and references
|   `-- summaries/             # Part-level review and summary notes
|-- code/                      # Runnable examples aligned with chapters
|-- scripts/                   # Maintenance and verification scripts
|-- package.json               # Site scripts and dependencies
|-- AGENTS.md                  # Repository maintenance guide
`-- README.md                  # Main project overview

Development Commands

npm run dev           # Start the local documentation server
npm run build         # Build the static site
npm run preview       # Preview the built site locally
npm run format        # Format repository files with Prettier
npm run format:check  # Check formatting
npm run lint          # Lint VitePress theme code
npm run verify        # Run format check, lint, build, and artifact verification

Contributing

Contributions should make the course clearer, more accurate, easier to reproduce, or easier to navigate.

Good contributions include:

Fixing conceptual errors, formulas, diagrams, broken links, or typos.
Improving explanations without changing the intended learning path.
Adding small, reproducible experiments that clarify existing chapters.
Improving scripts, build reliability, navigation, or accessibility.
Adding high-quality references to papers, official documentation, or widely used open-source implementations.

Please keep pull requests focused. A good PR usually changes one chapter, one experiment, one group of diagrams, or one infrastructure issue at a time.

When adding content:

Put course material under docs/.
Use kebab-case for new directories and files.
Prefer directory-based routes with index.md.
Update docs/.vitepress/config.mjs when adding navigable pages.
Run npm run verify before requesting review if your change touches config, theme, scripts, or generated site output.
Use Conventional Commits, such as docs: clarify ppo clipping or fix: repair chapter link.

For repository-specific maintenance rules, see AGENTS.md.

Other Courses

Our team has also created other courses. Take a look:

WeChat Group (微信)

For suggestions or feedback, scan the QR code to join the WeChat group (微信):

Citation

If you use this course in teaching materials, study notes, or derivative non-commercial educational work, please cite the repository:

@misc{hands_on_modern_rl,
  title        = {Hands-On Modern RL: Practice-first reinforcement learning from CartPole to LLM post-training and agentic systems},
  author       = {WalkingLabs},
  year         = {2026},
  howpublished = {\url{https://github.com/walkinglabs/hands-on-modern-rl}},
  note         = {Open courseware repository}
}

License

This course is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

You may share and adapt the material for non-commercial purposes, provided that you give appropriate credit and distribute derivative works under the same license.

Star History

_{Maintained by WalkingLabs and contributors.}

Name		Name	Last commit message	Last commit date
Latest commit History 288 Commits
.github/workflows		.github/workflows
code		code
docs		docs
scripts		scripts
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
LICENSE		LICENSE
README.md		README.md
README.zh.md		README.zh.md
eslint.config.js		eslint.config.js
fix_headings.py		fix_headings.py
package-lock.json		package-lock.json
package.json		package.json
restructure.py		restructure.py
vercel.json		vercel.json

Folders and files

Latest commit

History

Repository files navigation

Hands-On Modern RL

Course Preview

Contents

Overview

Design Principles

Audience

Learning Goals

Current Status

News

Roadmap

Course Outline

Preface

Part 1: Foundations by Practice

Part 2: Core Theory and Methods

Part 3: LLM-era RL

Part 4: Frontier and Advanced Systems

Appendices

Experiment Code

Recommended Learning Path

Quick Start

Read Online

Run the Documentation Site Locally

Verify the Site

Run Course Code

Repository Structure

Development Commands

Contributing

Other Courses

WeChat Group (微信)

Citation

License

Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages