Skip to content

newfacade/LeetCodeDataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LeetCodeDataset

LeetCodeDataset is a dataset comprising Python LeetCode problems designed for training and evaluating Large Language Models (LLMs).

đź’» Hugging Face Datasets

Data Fields

The dataset adheres to the human-eval problem file format.

  • task_id: The LeetCode problem's question title slug, which corresponds to the problem URL.
  • prompt: The prefix for the completion, such as basic imports.
  • entry_point: The function name used for evaluation.
  • test: A function to check test cases.
  • completion: The completion without the prompt.
  • query: The query including problem description and starter code.
  • response: The correct response.
  • input_output: Test cases.
  • meta:
    • question_id: The LeetCode problem's question ID.
    • difficulty: The problem's difficulty level (Easy, Medium, or Hard).
    • lang_code: The format of the completion.
    • question_title: The problem description.
    • tags: E.g. ['Array', 'Hash Table']
    • estimated_date: Estimated release date

Training

LeetCodeDataset can be used for training as follows:

  1. The dataset is split into training and test sets. Problems are ordered by question_id, with those having larger question_id values used for the test set.
  2. Use query as the query and response as the response to train the LLM using the training split.

The number of problems in each version and split is as follows:

Version Train Test
v0.1.0 1570 175
v0.2.0 1890 200
v0.3.0 2386 386

Evaluation

Installation

git clone https://github.com/newfacade/LeetCodeDataset
pip install -e .

LeetCodeDataset Evaluation Example

eval_lcd --version v0.3.0 \
         --split test \
         --input_file ./data/LeetCodeDataset-v0.3.0-test.jsonl \
         --predict_column completion

Explanation of Parameters

  • version: v0.1.0 or v0.2.0 or v0.3.0.
  • split: test or train.
  • input_file: A JSONL file containing the problems and predictions for the specified LeetCodeDataset, with task_id and prediction.
  • predict_column: The column name of the prediction in input_file, e.g., {'task_id': 'two_sum', 'output': 'To solve the problem of finding two indices ...'} uses --predict_column output.

You can also perform custom evaluations using the evaluate_functional_correctness command, which is consistent with human-eval.

Data Curation

  1. Metadata Acquisition, including: – question id: unique numeric identifier – question: url-related string (serves as primary task id) – problem description – starter code
  2. Canonical Solution Verification
    • Retrieved reference solutions from GitHub open-source datasets
    • Validated solution correctness through LeetCode’s official execution environment
  3. Entry Point Identification: Implemented text pattern matching to detect target functions
  4. Test Case Generation
  5. Automated Evaluation Framework
    • Developed sandboxed execution environment for safe code evaluation
    • Implemented trial-and-error mechanism to Execute canonical solutions against generated inputs

Paper/blog/projects Using LeetCodeDataset

Citation

@software{xia2025leetcodedataset,
  author = {Yunhui Xia, Wei Shen, Jason Klein Liu, Yan Wang, Siyue Wu, Xiaonan He},
  title = {LeetCodeDataset: A Dataset of Algorithmic Problems Suitable for LLM Training and Evaluation},
  year = {2025},
  url = {https://github.com/newfacade/LeetCodeDataset},
  version = {0.1.0},
}

🙏 Acknowledgment

About

LeetCode Training and Evaluation Dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages