Skip to content
forked from ganler/code-r1

Reproducing R1 for Code with Reliable Rewards

License

Notifications You must be signed in to change notification settings

KodCode-AI/code-r1

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code-R1: Reproducing R1 for Code with Reliable Rewards

This repository includes implementations to reproduce the R1 pipeline for coding with synthetic datasets.

Some contents in this readme are modified for KodCode.

Setup

Environment

# For training
pip install -e .
pip install vllm==0.8.1
pip install flash-attn --no-build-isolation
pip install wandb IPython matplotlib gpustat # utility
pip install -U "huggingface_hub[cli]"

sudo apt-get install python3-pytest -y # pytest env for kodcode

Sandboxing

Please install firejail for reliable sandboxing.

sudo add-apt-repository ppa:deki/firejail
sudo apt-get update
sudo apt-get install firejail firejail-profiles

# Alternatively, build from source
cd firejail
sudo apt-get install gawk -y
chmod +x ./configure
chmod +x src/man/mkman.sh
./configure && make && sudo make install-strip
cd ..

Datasets (Code-R1-12K)

The current version has 12K RL samples (prompt + tests) at 🤗 ganler/code-r1-12k:

  • 2K LeetCode data where the tests are generally reliable
  • 10K verified data filtered from 26K TACO data.

In general, it's suggesgted to test data & sandbox on every dataset & environment before training code RL. Directly using noisy data and mismatched envornments can lead to reward false positives, confusing the model. These noise could come from (i) wrong tests, (ii) unsolvable prompts (e.g., images tags), and (iii) execution environment mismatch.

python examples/data_preprocess/coder1.py

Datasets (KodCode-Light)

The current version has 10K RL samples (prompt + tests) at 🤗 KodCode/KodCode-Light-RL-10K:

To produce locally validated RL data:

python examples/data_preprocess/kodcode.py

Run KodCode!

We have tested several GRPO configurations across various GPU types (including A100, A6000, and RTX 4090). For example, to fine-tune a 7B model with GRPO using 8×A100 GPUs, simply run:

bash kodcode_grpo_7b_8a100.sh

Code-R1 based on 7B models + KodCode-Light

Experimental Setup. We conduct RL experiments on both Qwen2.5-7B-Instruct-1M and Qwen2.5-Coder-7B-Instruct using 10K randomly selected samples from KODCODE-V1, which we named as 🤗 KodCode/KodCode-Light-RL-10K. We perform GRPO with actor learning rate of 5*10^-7, 16 rollouts per question, a batch size of 256, max response length of 4096, and apply KL coefficient of 0.001.

Experimental Results. The experimental results are demonstrated in the table below. We observe significant performance improvement on all benchmarks compared to baselines after RL. In addition, we observe that continuing to increase the training steps can further enhance the performance.

Model LiveCodeBench (Easy|Medium|Hard) BCB-Complete (Full|Hard) BCB-Instruct (Full|Hard) HumanEval (Base|+) MBPP (Base|+) Average
Qwen2.5-Coder-7B-Instruct 0.574 | 0.230 | 0.044 0.520 | 0.216 0.418 | 0.196 0.915 | 0.854 0.831 | 0.717 0.5232
+ RL KodCode-10K (Step 128) 0.652 | 0.211 | 0.041 0.525 | 0.257 0.422 | 0.203 0.909 | 0.860 0.849 | 0.728 0.5356
+ RL KodCode-10K (Step 256) 0.645 | 0.199 | 0.033 0.537 | 0.270 0.429 | 0.216 0.902 | 0.854 0.865 | 0.741 0.5399
Qwen2.5-7B-Instruct-1M (Q7I1M) 0.577 | 0.124 | 0.037 0.453 | 0.142 0.366 | 0.176 0.860 | 0.793 0.788 | 0.693 0.4763
+ RL KodCode-10K (Step 128) 0.602 | 0.190 | 0.026 0.470 | 0.196 0.367 | 0.135 0.902 | 0.835 0.810 | 0.709 0.4969
+ RL KodCode-10K (Step 256) 0.570 | 0.187 | 0.030 0.482 | 0.196 0.368 | 0.128 0.915 | 0.860 0.828 | 0.728 0.503

Citation

If you find this work helpful...

@article{code-r1,
  title={Code-R1: Reproducing R1 for Code with Reliable Rewards},
  author={Liu, Jiawei and Zhang, Lingming},
  howpublished={\url{https://github.com/ganler/code-r1}},
  year={2025}
}

@article{xu2025kodcode,
      title={KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding}, 
      author={Zhangchen Xu and Yang Liu and Yueqin Yin and Mingyuan Zhou and Radha Poovendran},
      year={2025},
      eprint={2503.02951},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2503.02951}, 
}

Acknowledgements

License

Apache-2.0. See LICENSE.code-r1 for more details.

About

Reproducing R1 for Code with Reliable Rewards

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 44.0%
  • C 40.5%
  • Shell 7.1%
  • Roff 5.2%
  • Makefile 2.3%
  • Vim Script 0.4%
  • Other 0.5%