Important
🔥 News!!!
- [2025/03] We release the training record of an early version of DAPO (w/o Token-level PG Loss & Dynamic Sampling), achieving 44% on AIME 2024, in wandb.
We release a fully open-sourced system for large-scale LLM RL, including algorithm, code infrastructure, and dataset. The system achieves state-of-the-art large-scale LLM RL performance. We propose the Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) algorithm. Through open-sourcing, we provide the broader research community and society with practical access to scalable reinforcement learning, enabling all to benefit from these advancements. Our system is based on the awesome verl framework. Thanks for their great work!
🤗 If you have any questions about our paper, issues are welcomed and we could discuss there. Thank you!
🚀 DAPO achieves 50 points on AIME 2024 based on the Qwen2.5-32B base model, outperforming the previous SoTA DeepSeek-R1-Zero-Qwen-32B with 50% training steps.
-
Length stability and growth: The steady increase in response length allows for greater exploration, facilitating the model’s ability to learn more complex reasoning behaviors, ultimately contributing to training stability and performance improvement.
-
Reward score stability: A stable increase in the reward signal indicates that the model is successfully fitting the training distribution, ensuring that the learning process remains robust and consistent without significant fluctuations.
-
Entropy and mean probability trend: A controlled increase in entropy, after an initial decrease, ensures a healthy balance between exploration and exploitation, avoiding issues such as overfitting or excessive randomness, and promoting sustained model performance.
To benefit the broader research community, we fully open-source the recipe of our RL training, including algorithm details, dataset, and infrastructures.
We provide training and validation datasets for DAPO training.
Training: DAPO-Math-17k, a carefully curated and processed math dataset. Validation: AIME 2024.
We provide the out-of-the-box script for DAPO training reproduction. Quickstart and core code are mentioned in README. These are scripts for:
- Datasets Preparation
- DAPO w/o Token-level PG Loss & Dynamic Sampling -- AIME 44
- DAPO Full -- AIME 50
Note:
-
The
DAPO w/o Token-level PG Loss & Dynamic Sampling -- AIME 44
script has been verified on the current verl and achieves 44 points on AIME, whose training record can be accessed in wandb. -
The final performance of DAPO (50 on AIME) is achieved using the full DAPO algorithm based on our internal codebase, which includes heavy engineering optimization code based on verl. The
DAPO Full
script provides the command to run the full DAPO algorithm. But we still have not verified it on verl.
We thank the verl for providing the awesome open-source RL infrastructure.
Our open-sourced experiments were conducted on the Volcano Engine Machine Learning Platform. We will provide a full reproduction guideline later on the Volcano Engine platform to help users replicate our experiments.