DAPO: an Open-source RL System from
ByteDance Seed and Tsinghua AIR

Important

🔥 News!!!

[2025/03] We release the training record of an early version of DAPO (w/o Token-level PG Loss & Dynamic Sampling), achieving 44% on AIME 2024, in wandb.

We release a fully open-sourced system for large-scale LLM RL, including algorithm, code infrastructure, and dataset. The system achieves state-of-the-art large-scale LLM RL performance. We propose the Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) algorithm. Through open-sourcing, we provide the broader research community and society with practical access to scalable reinforcement learning, enabling all to benefit from these advancements. Our system is based on the awesome verl framework. Thanks for their great work!

Discussions Welcomed

🤗 If you have any questions about our paper, issues are welcomed and we could discuss there. Thank you!

Key Results

AIME 2024 Performance

🚀 DAPO achieves 50 points on AIME 2024 based on the Qwen2.5-32B base model, outperforming the previous SoTA DeepSeek-R1-Zero-Qwen-32B with 50% training steps.

Metric Supervision during Training

Length stability and growth: The steady increase in response length allows for greater exploration, facilitating the model’s ability to learn more complex reasoning behaviors, ultimately contributing to training stability and performance improvement.
Reward score stability: A stable increase in the reward signal indicates that the model is successfully fitting the training distribution, ensuring that the learning process remains robust and consistent without significant fluctuations.
Entropy and mean probability trend: A controlled increase in entropy, after an initial decrease, ensures a healthy balance between exploration and exploitation, avoiding issues such as overfitting or excessive randomness, and promoting sustained model performance.

Reproducibility

To benefit the broader research community, we fully open-source the recipe of our RL training, including algorithm details, dataset, and infrastructures.

Datasets

We provide training and validation datasets for DAPO training.

Training: DAPO-Math-17k, a carefully curated and processed math dataset. Validation: AIME 2024.

Training

We provide the out-of-the-box script for DAPO training reproduction. Quickstart and core code are mentioned in README. These are scripts for:

Note:

The DAPO w/o Token-level PG Loss & Dynamic Sampling -- AIME 44 script has been verified on the current verl and achieves 44 points on AIME, whose training record can be accessed in wandb.
The final performance of DAPO (50 on AIME) is achieved using the full DAPO algorithm based on our internal codebase, which includes heavy engineering optimization code based on verl. The DAPO Full script provides the command to run the full DAPO algorithm. But we still have not verified it on verl.

Acknowledgement

We thank the verl for providing the awesome open-source RL infrastructure.

Our open-sourced experiments were conducted on the Volcano Engine Machine Learning Platform. We will provide a full reproduction guideline later on the Volcano Engine platform to help users replicate our experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
img		img
README.md		README.md
dapo_paper.pdf		dapo_paper.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DAPO: an Open-source RL System from
ByteDance Seed and Tsinghua AIR

Discussions Welcomed

Key Results

AIME 2024 Performance

Metric Supervision during Training

Reproducibility

Datasets

Training

Acknowledgement

About

Releases

Packages

Contributors 5

BytedTsinghua-SIA/DAPO

Folders and files

Latest commit

History

Repository files navigation

DAPO: an Open-source RL System from ByteDance Seed and Tsinghua AIR

Discussions Welcomed

Key Results

AIME 2024 Performance

Metric Supervision during Training

Reproducibility

Datasets

Training

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

DAPO: an Open-source RL System from
ByteDance Seed and Tsinghua AIR

Packages