X-R1 aims to build an easy-to-use, low-cost training framework based on end-to-end reinforcement learning to accelerate the development of Scaling Post-Training
Inspired by DeepSeek-R1 and open-r1 , we produce minimal-cost for training 0.5B R1-Zero "Aha Moment"💡 from base model
- 🔥Training with LoRA
- 4x3090/4090 GPUs training 1hour, 💰cost < 7 dollar, 10min 37'step output “aha Moment“ 💡
- 0.5B scale model RL training
- support BIGGER model: 1.5B/7B/32B...
- We supply 0.75k/1.5k/7.5k dataset for fast train loop
- We logging GRPO online sampling data to log file
- 2025.02.16 Support LoRA
- 2025.02.15 Release Chinese Training
- 2025.02.13 Release X-R1-3B, whick better follow format. colab inference
- 2025.02.12 Release X-R1-1.5B config/wandb/model/log
- 2025.02.12: Release X-R1 first version
Running Scripts:
bash ./scripts/run_x_r1_zero.sh
We would share training details about config/wandb/model/log, also evaluation results:
📈 wandb details | 🔥 Colab Inference | 🤗 Models
We have confirmed the effectiveness of the X-R1 RL-Zero-training method for 0.5B/1.5B/3B-Base
model, We can observe that in the without-SFT, reinforcement learning has Incentivizing the model's reasoning abilities and format-following capabilities, and the experimental results of X-R1 are very encouraging.
training config
Model | 0.5B | 1.5B | 3B | 7B |
---|---|---|---|---|
TargetModel | X-R1-0.5B | X-R1-1.5B | X-R1-3B | |
Log | [link] | [link] | [link] | |
GPU | 4x3090 | 4x3090 | 4x3090 | |
Base | Qwen/Qwen2.5-0.5B | Qwen/Qwen2.5-1.5B | Qwen/Qwen2.5-3B | |
Dataset | X-R1-750 | X-R1-750 | X-R1-750 | |
Config: recipes | X_R1_zero_0dot5B_config.yaml | X_R1_zero_1dot5B_config.yaml | X_R1_zero_3B_config.yaml | |
num_generations | 16 | 8 | 4 | |
max_completion_length | 1024 | 1024 | 1024 | |
num_train_epochs | 3 | 3 | 3 | |
Times | 1:14:10 | 1:59:06 | 2:23:06 |
0.5B, 4x3090. if you have 4 GPUs, you should set --num_processes=3
. One GPU deploy vLLM as online inference engine, for faster GRPO sampling
example: 4x4090, 3epochs, training time, ~1h20min
ACCELERATE_LOG_LEVEL=info \
accelerate launch \
--config_file recipes/zero3.yaml \
--num_processes=3 \
src/x_r1/grpo.py \
--config recipes/X_R1_zero_0dot5B_config_peft.yaml \
> ./output/x_r1_0dotB_sampling.log 2>&1
tips : use --config recipes/X_R1_zero_3B_config.yaml
for better learning reasoning and format
Wait, that doesn't match either of our options. It seems like I made a mistake in my assumptions. Let's go back to the original equations
X-R1 support chinese math reasoning, it's easy to make chinese Aha Moment
, as follow
ACCELERATE_LOG_LEVEL=info \
accelerate launch \
--config_file recipes/zero3.yaml \
--num_processes=3 \
src/x_r1/grpo.py \
--config recipes/examples/mathcn_zero_3B_config.yaml \
> ./output/mathcn_3B_sampling.log 2>&1
X-R1 use 4x3090 ~16h training 3B-base with 7.5k chinese math problem.
X-R1-3B-CN training log we track ”Aha Moment”
required: cuda >= 12.4
conda create -n xr1 python=3.11
conda activate xr1
and
pip install -r requirements.txt
pip install flash-attn
for test environment:
mkdir output
[option]: single GPU with LoRA:
ACCELERATE_LOG_LEVEL=info \
accelerate launch \
--config_file recipes/zero1.yaml \
--num_processes=1 \
src/x_r1/grpo.py \
--config recipes/X_R1_zero_0dot5B_peft_config.yaml \
> ./output/x_r1_test_sampling.log 2>&1
[option]Multi-GPU:
ACCELERATE_LOG_LEVEL=info \
accelerate launch \
--config_file recipes/accelerate_configs/zero3.yaml \
--num_processes=1 \
src/x_r1/grpo.py \
--config recipes/x_r1_test_sampling.yaml \
> ./output/test.log 2>&1
and we check log file: ./output/test.log
- support QLoRA GRPO Trainning
- Release 7B config/result
- add more rule reward
- support more base model
- add benchmark evaluation reuslt
If you have any suggestions, please contact: [email protected]