Skip to content

mit-han-lab/vcpo

Repository files navigation

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

Paper Github

Overview

We introduce Variance Controlled Off-Policy Optimization (VCPO), a framework that adds explicit variance-targeted controls for policy-gradient methods in the off-policy setting, enabling stable and scalable Async RL training.

  • ✨ Seamlessly integrates into common policy-gradient methods like REINFORCE/RLOO/GRPO
  • 🚀 2.5x faster Async RL training while matching synchronous RL performance
  • 🧠 Robust training stability under high off-policy settings (at least k=128 steps off-policy)

Async RL pipelines rollout generation with learning, significantly reducing end-to-end training time. But large speedups typically require high policy lag that can cause collapse.

Why? Highly stale rollouts make importance sampling ratios heavy-tailed, so a few trajectories dominate each update and the policy-gradient estimator becomes high-variance. Previous work have tried masking/clipping/whitening IS ratios, algorithmic changes, and system-side changes. These can delay collapse… but still fail at high asynchrony.

To address this, VCPO introduces two techniques to stabilize policy-gradient methods for asynchronous RL training:

  1. ESS-guided step scaling to dampen unreliable updates, following sqrt scaling for AdamW-style optimizers.

$$ \eta_{\text{eff}} \propto \sqrt{\rho_{\text{ess}}}, \qquad \rho_{\text{ess}} \triangleq \frac{\mathrm{ESS}}{B} \triangleq \frac{1}{B}\frac{\left(\sum_{i=1}^{B} w_i\right)^2}{\sum_{i=1}^{B} w_i^2} $$

  1. Closed-form off-policy optimal baseline (OPOB) using gradient norm and importance ratios (no learned critic), , implemented with minimal overhead and compatible with DPxTPxSP:

$$ b_{\text{OPOB}}^\star=\frac{\sum_{i=1}^N w_i^2 |\nabla_\theta \log \pi_\theta(\tau_i)|^2 R_i}{\sum_{i=1}^N w_i^2 |\nabla_\theta \log \pi_\theta(\tau_i)|^2} $$

Results

We use k to denote the maximum sampler–learner policy lag (i.e., k steps off-policy), following the PipelineRL setting. Across math, general reasoning, and tool-use tasks with model sizes from 1.5B to 7B, VCPO enables stable asynchronous training where prior stabilizers fail. In long-context multi-turn RL, VCPO delivers a 2.5× end-to-end speedup while matching synchronous performance.

End-to-end training time vs. validation accuracy for synchronous (k=0) and asynchronous training (lag k).
Here, Steps denotes gradient update steps, and GPU hours ↓ measures total wall-clock time across sampling + training GPUs

Countdown

Method Countdown Acc ↑ Steps GPU hours ↓
Base 1.6% -- --
Sync (k=0) 38.4% 400 143.2
VCPO + Async (k=10) 41.9% 400 89.6

MATH-500

Method MATH-500 Acc ↑ Steps GPU hours ↓
Base 40.2% -- --
Sync (k=0) 72.0% 400 134.4
VCPO + Async (k=10) 71.6% 400 92.8

AIME 2025

Method AIME 2025 Acc ↑ Steps GPU hours ↓
Base 5.3% -- --
Sync (k=0) 26.7% 300 420.2
VCPO + Async (k=2) 27.8% 220 168.9

Async RL already achieves its full speedups at <10-steps off-policy, but we stress-tested far beyond that and found VCPO remains stable up to at least 128 steps off-policy.

Getting Started

VCPO is implemented for the Megatron backend, with core logic in megatron_actor.py, vcpo.py, and staleness_utils.py. Training scripts are under recipe/fully_async_policy/shell/vcpo/.

1. Install — follow the veRL documentation to set up the environment. Specifically, we use Megatron-Core 0.13.1 with vLLM 0.11.0 following the conda installation instructions.

2. Prepare data

hf download lukhuang/vcpo --repo-type dataset --local-dir data

3. Train

Edit the model and data paths in the script, then launch

GSM8K and MATH-500 Experiments

GSM8K experiments use the Qwen2-1.5B model and use the official train-test split.

# Synchronous (k=0)
bash recipe/fully_async_policy/shell/vcpo/gsm8k/synchronous.sh

# Fully asynchronous VCPO (k=12)
bash recipe/fully_async_policy/shell/vcpo/gsm8k/vcpo_k=12.sh

MATH experiments use the Qwen2.5-7B model and use the official train-test split.

# Synchronous
bash recipe/fully_async_policy/shell/vcpo/math/synchronous.sh

# Fully asynchronous training + VCPO
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=10.sh

# Highly off-policy asynchronous training + VCPO
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=16.sh  # k=16 steps off-policy
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=32.sh  # k=32 steps off-policy
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=64.sh  # k=64 steps off-policy
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=128.sh # k=128 steps off-policy

Long-Horizon Tool-Use Experiments

We evaluate long-horizon tool use in the SimpleTIR setting, where the model must interleave reasoning with external tool calls. We train using the DAPO dataset and evaluate on a held-out exam-style benchmark (AIME2025).

# Synchronous
bash recipe/fully_async_policy/shell/vcpo/multiturn/synchronous.sh

# Fully asynchronous VCPO
bash recipe/fully_async_policy/shell/vcpo/multiturn/vcpo_k=2.sh

Citation

If you find this work useful, please consider citing:

@article{huang2026stable,
  title        = {Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs},
  author       = {Luke J. Huang and Zhuoyang Zhang and Qinghao Hu and Shang Yang and Song Han},
  year         = {2026},
  month         = feb,
  eprint       = {2602.17616},
  archivePrefix= {arXiv},
  primaryClass = {cs.LG},
  url          = {https://arxiv.org/abs/2602.17616}
}

License and Attribution

This repository was implemented on top of veRL at commit 15a9b0f58a8be2445417493ae7911439c9700cf2.

It is licensed under the Apache License, Version 2.0. See LICENSE for details.

About

[ICML 2026] Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors