Skip to content

FastCuRL: Curriculum Reinforcement Learning with Stage-wise Context Scaling for Efficient Training R1-like Reasoning Models

Notifications You must be signed in to change notification settings

nick7nlp/FastCuRL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 

Repository files navigation

FastCuRL: Curriculum Reinforcement Learning with Stage-wise Context Scaling for Efficient Training R1-like Reasoning Models

Paper Hugging Face

2025.05.23

We release FastCuRL-1.5B-V3 and FastCuRL-1.5B-V2.

2025.03.17

We release FastCuRL-1.5B-Preview, a slow-thinking reasoning model that outperforms 📈 the previous SoTA DeepScaleR-1.5B-Preview with 🚀 50% training steps! We propose a curriculum RL framework with stage-wise context scaling to achieve efficient training and concise CoT reasoning based on DeepSeek-R1-Distil-Qwen-1.5B and observe continuous performance improvement as training steps increase. To better reproduce our work and advance research progress, we open-source our code, model, and data.

Key Results

Model Training Steps Training Stages Number of GPUs Used in Each Stage
DeepScaleR-1.5B-Preview ~1,750 3 8, 16, 32
FastCuRL-1.5B-Preview ~860 4 8, 8, 8, 8
FastCuRL-1.5B-V2 ~1,710 5 8, 8, 8, 8, 8
FastCuRL-1.5B-V3 ~2,620 5 8, 8, 8, 8, 8

Here, we uniformly set the batch size to 128 for counting training steps, meaning two steps with a batch size of 64 are counted as one with a batch size of 128.

We report Pass@1 accuracy averaged over 16 samples for each problem.

Model AIME 2024 MATH 500 AMC 2023 Minerva Math OlympiadBench Avg.
Qwen2.5-Math-7B-Instruct 13.3 79.8 50.6 34.6 40.7 43.8
rStar-Math-7B 26.7 78.4 47.5 - 47.1 -
Eurus-2-7B-PRIME 26.7 79.2 57.8 38.6 42.1 48.9
Qwen2.5-7B-SimpleRL 26.7 82.4 62.5 39.7 43.3 50.9
DeepSeek-R1-Distill-Qwen-1.5B 28.8 82.8 62.9 26.5 43.3 48.9
Still-1.5B 32.5 84.4 66.7 29.0 45.4 51.6
DeepScaleR-1.5B-Preview 43.1 87.8 73.6 30.2 50.0 57.0
FastCuRL-1.5B-Preview 43.1 88.0 74.2 31.6 50.4 57.5
FastCuRL-1.5B-V2 47.5 89.3 77.0 32.8 53.3 60.0
FastCuRL-1.5B-V3 49.6 90.5 78.5 34.7 54.5 61.6

Getting Started 🎯

Installation

# Installing Python 3.10 Environment.
conda create -n rllm python=3.10 -y
conda activate rllm

# Installing RLLM dependencies.
cd rllm
pip install -e ./verl
pip install -e .

Training Data

Following DeepScaleR, our training dataset consists of 40,315 unique problem-answer pairs compiled from:

  • AIME problems (1984-2023)
  • AMC problems (before 2023)
  • Omni-MATH dataset
  • Still dataset

Entropy

Training Scripts

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export VLLM_ATTENTION_BACKEND=XFORMERS

# Run 8K context length training, 160 steps
bash ./scripts/train/run_fastcurl_1.5b_8k_stage1.sh | tee -a fastcurl-1.5b-stage1.log

# Run 16K context length training, 590 steps
bash ./scripts/train/run_fastcurl_1.5b_16k_stage2.sh | tee -a fastcurl-1.5b-stage2.log

# Run 24K context length training, 230 steps
bash ./scripts/train/run_fastcurl_1.5b_24k_stage3.sh | tee -a fastcurl-1.5b-stage3.log

# Run 16K context length training, 580 steps
bash ./scripts/train/run_fastcurl_1.5b_16k_stage4.sh | tee -a fastcurl-1.5b-stage4.log

Evaluate

python3 -m verl.trainer.main_generation \
    trainer.nnodes=1 \
    trainer.n_gpus_per_node=8 \
    data.path=./fastcurl/data/test/xxx.parquet \
    data.output_path=${OUTPUT_DIR}/xxx.parquet \
    data.n_samples=16 \
    data.batch_size=2048 I am running a few minutes late; my previous meeting is running over.
    
    model.path=${MODEL_PATH} \
    rollout.temperature=0.6 \
    rollout.response_length=32768 \
    rollout.top_k=-1 \
    rollout.top_p=1 \
    rollout.gpu_memory_utilization=0.9 \
    rollout.tensor_model_parallel_size=1

Citation

@misc{fastcurl,
      title={FastCuRL: Curriculum Reinforcement Learning with Stage-wise Context Scaling for Efficient Training R1-like Reasoning Models}, 
      author={Mingyang Song and Mao Zheng and Zheng Li and Wenjie Yang and Xuan Luo and Yue Pan and Feng Zhang},
      year={2025},
      eprint={2503.17287},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.17287}, 
}

Acknowledgements

  • Our model is trained on top of DeepSeek-R1-Distill-Qwen-1.5B.
  • Our training experiments are powered by our heavily modified fork of verl.
  • We directly use DeepScaleR's code to finish our experiments. However, we have modified parts of the code related to naming conflicts to avoid confusion.

About

FastCuRL: Curriculum Reinforcement Learning with Stage-wise Context Scaling for Efficient Training R1-like Reasoning Models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published