Releases: bigcode-project/bigcode-evaluation-harness
Releases · bigcode-project/bigcode-evaluation-harness
Initial release of BigCode Evaluation Harness
Release notes
These are the release notes of the initial release of the BigCode Evaluation Harness.
Goals
The framework aims to achieve the following goals:
- Reproducibility: Making it easy to report and reproduce results.
- Ease-of-use: Providing access to a diverse range of code benchmarks through a unified interface.
- Efficiency: Leveraging data parallelism on multiple GPUs to generate benchmark solutions quickly.
- Isolation: Using Docker containers for executing the generated solutions.
Release overview
The framework supports the following features & tasks:
-
Features:
- Any autoregressive model available on Hugging Face hub can be used, but we recommend using code generation models trained specifically on Code.
- We provide Multi-GPU text generation with
accelerate
for multi-sample problems and Dockerfiles for evaluating on Docker containers for security and reproducibility.
-
Tasks:
- 4 code generation Python tasks (with unit tests): HumanEval, APPS, MBPP and DS-1000 for both completion (left-to-right) and insertion (FIM) mode.
- MultiPL-E evaluation suite (HumanEval translated into 18 programming languages).
- Pal Program-aided Language Models evaluation for grade school math problems : GSM8K and GSM-HARD. These problems are solved by generating reasoning chains of text and code.
- Code to text task from CodeXGLUE (zero-shot & fine-tuning) for 6 languages: Python, Go, Ruby, Java, JavaScript and PHP. Documentation translation task from CodeXGLUE.
- CoNaLa for Python code generation (2-shot setting and evaluation with BLEU score).
- Concode for Java code generation (2-shot setting and evaluation with BLEU score).
- 3 multilingual downstream classification tasks: Java Complexity prediction, Java code equivalence prediction, C code defect prediction.
More details about each task can be found in the documentation in docs/README.md
.
Main Contributors
Full Changelog: https://github.com/bigcode-project/bigcode-evaluation-harness/commits/v0.1.0