With significant advances in Vision-Language-Action (VLA)🍔 models based on large-scale imitation learning, integrating VLA with Reinforcement Learning (RL)🥤 has emerged as a promising paradigm. This paradigm leverages the benefits of trial-and-error interactions with environments or pre-collected sub-optimal data.
This repository summarizes recent advances in the VLA🍔 + RL🥤 paradigm and provides a classification of relevant works (offline RL training(without env.), online RL training(with env.), Model-Based RL (with world model as env.) test-time RL(during deployment), and RL alignment).
Contributions are welcome! Please feel free to submit an issue or reach out via email to add papers!
If you find this repository useful, please giving this list a star ⭐. Feel free to share it with others!
The Offline RL pre-trained VLA models leverage both human demonstrations and autonomously collected data.
| Method | Title | Venue | Date | Code/Project | Key feature/finding |
|---|---|---|---|---|---|
| Q-Transformer | Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions | Arxiv | 18/9/2023 | Github | Detailsoffline Q-learning with Transformer models: 1. Autoregressive Discrete Q-Learning; 2. Conservative Q-Learning; 3. Monte Carlo and n-step Returns |
| Perceiver-Actor-Critic | Offline Actor-Critic Reinforcement Learning Scales to Large Models | ICML2024 | 08/2/2024 | Project | DetailsAn offline actor-critic method that scales to large models of up to 1B parameters and learn a wide variety of 132 control and robotics tasks |
| GeRM | GeRM: A Generalist Robotic Model with Mixture-of-experts for Quadruped Robot | IROS2024 | 20/3/2024 | Github | DetailsMixtureof-Experts structure; Quadruped robot learning |
| ReinboT | ReinboT: Amplifying Robot Visual-Language Manipulation with Reinforcement Learning | ICML2025 | 12/5/2025 | DetailsMax-Return Sequence Modeling as Reinformer; Reward Densification with heuristic methods |
|
| MoRE | MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models | ICRA2025 | 11/3/2025 | DetailsIntegrates multiple low-rank adaptation modules as distinct experts within a dense multi-modal large language model (MLLM), forming a sparse-activated mixture-of-experts model |
|
| CO-RFT | CO-RFT: Efficient Fine-Tuning of Vision-Language-Action Models through Chunked Offline Reinforcement Learning | Arxiv | 04/8/2025 | DetailsChunk-level offline RL finetuning. It proposed Chunked RL via n-step TD learning |
|
| ARFM | Balancing Signal and Variance: Adaptive Offline RL Post-Training for VLA Flow Models | Arxiv | 04/9/2025 | DetailsBy introducing an adaptively adjusted scaling factor in the VLA flow model loss, we construct a principled bias-variance trade-off objective function to optimally control the impact of RL signal on flow loss. ARFM adaptively balances RL advantage preservation and flow loss gradient variance control, resulting in a more stable and efficient fine-tuning process. |
With trial-and-error interactions in online environments, VLA models can be further optimized to improve their performance.
| Method | Title | Venue | Date | Code/Project | Key feature/finding |
|---|---|---|---|---|---|
| FLaRe | FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning | ICRA 2025 Best Paper Finalist | 30/9/2024 | Code | DetailsFor large-scale fine-tuning in simulation, it performs extensive domain randomization, extract visual features through DinoV2, and utilize the KV-cache technique during inference and a set of algorithmic choices to ensure the stability of RL fine-tuning |
| PA-RL | Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone | Arxiv | 9/12/2024 | Project | Detailsa single method that fine-tunes multiple policy classes, with varying architectures and sizes. It enables sample-efficient improvement of diffusion and transformer-based autoregressive policies. PA-RL sets a new state of the art for offline to online RL, and it makes it possible, for the first time, to improve OpenVLA |
| iRe-VLA | Improving Vision-Language-Action Model with Online Reinforcement Learning | RAL2025 | 28/1/2025 | DetailsAdopt SFT & RL two-stage iterative optimization to Stabilizing Training Process and Managing the Model Training Burden. |
|
| RIPT-VLA | Interactive Post-Training for Vision-Language-Action Models | Arxiv | 22/5/2025 | Github | DetailsA critic-free optimization framework called Leave-One-Out Proximal Policy Optimization (LOOP); Dynamic rollout sampling |
| VLA-RL | VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning | Arxiv | 24/5/2025 | Github | DetailsRobotic process reward model and the VLA-RL System with (1) Curriculum Selection Strategy (2) Critic Warmup (3) GPU-balanced Vectorized Environments (4) PPO infrastructure |
| RLVLA | What Can RL Bring to VLA Generalization? An Empirical Study | NeurIPS 2025 | 26/5/2025 | Github | DetailsPPO consistently outperforms GRPO and DPO; Shared actor-critic backbone; VLA warm-up |
| RFTF | RFTF: Reinforcement Fine-tuning for Embodied Agents with Temporal Feedback | Arxiv | 26/5/2025 | DetailsFor the sparse reward problem, RFTF leverages a value model trained using temporal information to generate dense rewards |
|
| SimpleVLA-RL | SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning | Arxiv | 12/9/2025 | Github | |
| TGRPO | TGRPO: Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization | Arxiv | 10/6/2025 | Github | DetailsFrom GRPO in LLM to TGRPO in VLA |
| OctoNav | OctoNav: Towards Generalist Embodied Navigation | Arxiv | 11/6/2025 | Project | DetailsFor Navigation tasks, it proposes a VLA+RL Hybrid Training Paradigm, including SFT, Nav-GRPO, Online RL stages. The VLA model also obtains thinking-before-action ability. |
| RLRC | RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models | Arxiv | 21/6/2025 | Project | DetailsA RL-based VLA compression Paradigm. Through a carefully designed three-stage pipeline, structured pruning, performance recovery based on SFT and RL, and 4bit quantization, they significantly reduce model size and boost inference speed while preserving, and in some cases surpassing, the original model’s ability to execute robotic tasks |
| RLinf | RLinf: Reinforcement Learning Infrastructure for Agentic AI | Arxiv | 8/2025 | Project | DetailsRLinf is a flexible and scalable open-source infrastructure designed for post-training foundation models via reinforcement learning. The ‘inf’ in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development. |
| RLinf-VLA | RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training | Arxiv | 10/2025 | Project | Details |
| Method | Title | Venue | Date | Code/Project | Key feature/finding |
|---|---|---|---|---|---|
| RLDG | RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning | RSS2025 | 12/2024 | Project | DetailsPretrain task-specific RL policies with HIL-SERL; Distill RL policies into VLA for Knowledge Transfer. |
| PA-RL | Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone | Arxiv | 9/12/2024 | Project | Detailsa single method that fine-tunes multiple policy classes, with varying architectures and sizes. It enables sample-efficient improvement of diffusion and transformer-based autoregressive policies. PA-RL sets a new state of the art for offline to online RL, and it makes it possible, for the first time, to improve OpenVLA |
| iRe-VLA | Improving Vision-Language-Action Model with Online Reinforcement Learning | RAL2025 | 28/1/2025 | DetailsAdopt SFT & RL two-stage iterative optimization to Stabilizing Training Process and Managing the Model Training Burden. |
|
| ConRFT | ConRFT: A Reinforced Fine-tuning Method for VLA Models via Consistency Policy | RSS2025 | 14/4/2025 | Github | DetailsOffline fine-tuning(Cal-QL) and online fine-tuning(CPQL+HIL-SERL) |
| VLAC | VLAC: A Vision-Language-Action-Critic Model for Robotic Real-World Reinforcement Learning | Github | 16/9/2025 | Github | DetailsVLAC is a general-purpose pair-wise critic and manipulation model which designed for real world robot reinforcement learning and data refinement. |
| Generalist | Self-Improving Embodied Foundation Models | NeurIPS 2025 | 18/9/2025 | DetailsA two stage paradigm: The first stage, Supervised Fine-Tuning (SFT), fine-tunes pretrained foundation models using both: a) behavioral cloning, and b) steps-to-go prediction objectives. In the second stage, Self-Improvement, steps-to-go prediction enables the extraction of a well-shaped reward function and a robust success detector, enabling a fleet of robots to autonomously practice downstream tasks with minimal human supervision. |
| Method | Title | Venue | Date | Code/Project | Key feature/finding |
|---|---|---|---|---|---|
| World-Env | World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training | Arxiv | 30/9/2025 | DetailsA world model-based framework that enables low-cost, safe reinforcement learning post-training for VLA policies under extreme data scarcity, eliminating the need for real-world interaction. |
|
| VLA-RFT | VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators | Arxiv | 01/10/2025 | Github | DetailsVLA-RFT, a reinforcement fine-tuning framework that leverages a data-driven world model as a controllable simulator. |
Leverage a value function pre-trained via offline RL.
| Method | Title | Venue | Date | Code/Project | Key feature/finding |
|---|---|---|---|---|---|
| Bellman-Guided Retrials | To Err is Robotic: Rapid Value-Based Trial-and-Error during Deployment | Arxiv | 22/6/2024 | Github | DetailsPre-train a value function to estimate task completion, recover the robot and sample a new strategy if failed |
| V-GPS | Steering Your Generalists: Improving Robotic Foundation Models via Value Guidance | CoRL2024 | 17/10/2024 | Project | DetailsRe-ranking multiple action proposals from a generalist policy using a value function at test-time |
| Hume | Hume: Introducing System-2 Thinking in Visual-Language-Action Model | Arxiv | 2/6/2025 | Github | DetailsPre-train a value function, perform best-of-N selection of candidate action chunks with state-action value estimation |
| VLA-Reasoner | VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning via Online Monte Carlo Tree Search | Arxiv | 26/9/2025 | Detailsplug-in framework named VLA-Reasoner that empowers VLAs with test-time MTCS to address their incremental deviations during deployment. |
| Method | Title | Venue | Date | Code/Project | Key feature/finding |
|---|---|---|---|---|---|
| GRAPE | GRAPE: Generalizing Robot Policy via Preference Alignment | ICLR2025 workshop | 4/2/2025 | Github | DetailsTrajectory-wise Preference Optimization aligns VLA policies on a trajectory level |
| SafeVLA | SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning | NeurIPS 2025 | 31/5/2025 | Project | DetailsConstraining VLA policies via safe reinforcement learning |
| Method | Title | Venue | Date | Code/Project | Key feature/finding |
|---|---|---|---|---|---|
| RPD | Refined Policy Distillation: From VLA Generalists to RL Experts | Arxiv | 6/3/2025 | DetailsLeverage VLA model as policy prior to improve sample-efficiency of RL, as Jump-Start RL |