The codebase for paper: A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning
📖 Overview
🚀 Quick Start
🔔 News
📈 Recipes
🔑 Key Findings
🧋 Build Your Own Agentic Pipelines!
📚 Citation
Welcome to the Meow-Tea Café! ☕️
Just as "Multi" sounds like "Meow-tea" 🐈 🍵, our RL framework brings together the best ingredients for cooking up powerful Multi-turn Agentic RL solutions.
At Meow-Tea Café, we serve a diverse menu of agentic dishes (tasks) ranging from text-based adventures to real-world software engineering challenges. Each dish in our meow_tea_train/agentic_menu/ represents a different agentic task:
- 🎮 TextWorld: Text-based adventure game environments
- 🏠 ALFWorld: Situated household tasks
- 💻 SWE-Gym: Realworld software engineering problems
- ...and more specialty dishes coming soon!
For each dish on our menu, we identify three essential RL cooking processes that can bring out its best flavors. We've made these components fully configurable in our framework:
- 🌎 Environments - The foundation of your dish (the agentic task itself, sync vs async rollout, tool use, thinking abilities)
- 🤖 Policies - Your RL cooking technique (PPO, GRPO, RLOO, and more)
- ⭐ Rewards - The perfect heat control and timing (single vs dense rewards, verified vs learned rewards)
These three pillars of RL cooking are the heart of our framework, allowing RL cooking lovers - whether you're a researcher, practitioner, or student - to experiment and explore innovative ways to think about and solve agentic tasks and challenges.
Can't find your favorite dish on our menu? No problem! The Meow-Tea Café includes a special build_your_own section, where we'll walk you through creating the agentic task (dish) you want to cook.
We provide tested recipes for different agentic tasks. These recipes are training configurations that have been validated through our experiments. You can find them under recipes/ and examples/{agentic_task}.
Our paper presents systematic findings on what works and what doesn't for multi-turn agentic RL. Key research questions we address include:
- Can we train agents on simpler environments and expect them to perform well on complex ones?
- How do different RL algorithms impact multi-turn RL training?
- Is there an optimal ratio of SFT:RL data given fixed budget?
- How does the density of rewards impact multi-turn RL training?
- ...and more in our paper
Start by creating and running a Docker container with GPU support
docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" --cap-add=SYS_ADMIN -v .:/workspace --name meow-tea-taro hiyouga/verl:ngc-th2.8.0-cu12.9-vllm0.11.0 sleep infinity
docker start meow-tea-taro
docker exec -it meow-tea-taro bashClone the latest meow-tea-taro repository and install:
git clone [email protected]:pearls-lab/meow-tea-taro.git
cd meow-tea-taro
pip install -e .That's it!
Run a quick example of multi-turn PPO on TextWorld tasks using Qwen2.5-0.5B-Instruct:
sh examples/textworld/run_ppo_qwen-0.5b.shYou should be able to see the training curve like this: wandb log.
Now you are ready to cook your RL dishes! Refer to the meow-tea-taro documentation for detailed environment, policy and reward configuration tutorials.
The datasets used in our meow_tea_experiments are available in 🤗 Huggingface: PEARLS-Lab/meow-tea-taro-dataset.
- 🎉 [10/21/2025] Meow-Tea-Taro codebase is now open-source! Recipes, datasets, and model checkpoints are available.
- 🎉 [10/01/2025] Paper "A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning" released
We share the recipes for TextWorld, ALFWorld, and SWE-Gym tasks here. The table records a summary of the agentic tasks, our configuration, and the performance of our recipes.
| Agentic Task | Base Model | Policy | Reward | Success Rate | Performance Delta | Script |
|---|---|---|---|---|---|---|
| TextWorld (w2-o3-q4) | Qwen2.5-1.5B-Instruct | PPO | Single | 97% | ⬆️ 82% | recipe |
| TextWorld (w4-o6-q8) | Qwen2.5-1.5B-Instruct | PPO | Single | 94% | ⬆️ 93% | recipe |
| TextWorld (cooking) | Qwen2.5-7B-Instruct | PPO | Dense | 58% | ⬆️ 29% | recipe |
| TextWorld (cooking) | Qwen2.5-7B-Instruct | RLOO | Dense | 55% | ⬆️ 26% | recipe |
| Alfworld (text-based) | Qwen2.5-7B-Instruct | PPO | Single | 74% | ⬆️ 73% | recipe (sft), recipe (ppo) |
| SWE-Gym | Qwen3-8B | GRPO | Single | 22% | ⬆️ 18% | recipe |
Check out paper for key finds and takeaways: A Pracititioner's Guide to Multi-turn Agentic Reinforcement Learning.
We are committed to expanding the codebase (adding more agentic tasks, more RL algorithms, and different reward modeling techniques). We will provide research insights in our subsequent experiments on agentic multi-turn RL along the way. Stay tuned!
Tutorials are under development. Will release real soon!
@misc{wang2025practitionersguidemultiturnagentic,
title={A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning},
author={Ruiyi Wang and Prithviraj Ammanabrolu},
year={2025},
eprint={2510.01132},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.01132},
}Join our community of RL chefs! 👨🍳👩🍳 At Meow-Tea Café, we're passionate about promoting open-source RL recipes and models. We welcome contributions, new recipes, and fresh ideas to make Meow-Tea Café even better.
Stay tuned as we keep expanding our menu and refining our recipes!