Collecting Evaluation Tasks for SWE-Bench

John Yang • November 1, 2023

In this tutorial, we explain how to use the SWE-Bench repository to collect evaluation task instances from GitHub repositories.

SWE-bench's collection pipeline is currently designed to target PyPI packages. We hope to expand SWE-bench to more repositories and languages in the future.

🔍 Selecting a Repository

SWE-bench constructs task instances from issues and pull requests. A good repository to source evaluation instances from should have many issues and pull requests. A point of reference for repositories that fit this bill would be the Top PyPI packages website.

Once you've selected a repository, use the /collect/make_repo/make_repo.sh script to create a mirror of the repository, like so:

./collect/make_repo/make_repo.sh scikit-learn/scikit-learn

⛏️ Collecting Candidate Tasks

Once you have cloned the repository, you can then use the collect/get_tasks_pipeline.py script to collect pull requests and convert them to candidate task instances. Supply the repository name(s) and logging folders as arguments to the run_get_tasks_pipeline.sh script, then run it like so:

./collect/run_get_tasks_pipeline.sh

At this point, for a repository, you should have...

A mirror clone of the repository under the SWE-bench organization.
A <repo name>-prs.jsonl file containing all the repository's PRs.
A <repo name>-task-instances.jsonl file containing all the candidate task instances.

📙 Specify Execution Parameters

This step is the most manual of all parts. To create an appropriate execution environment for task instances from a new repository, you must do the following steps:

Assign a repository-specific version (i.e. 1.2) to every task instance.
Specify repository+version-specific installation commands in harness/constants.py.

Part A: Versioning

Determining a version for each task instance can be accomplished in a number of ways, depending on the availability + feasability with respect to each repository.

Scrape from code: A version is explicitly specified in the codebase (in __init__.py or _version.py for PyPI packages).
Scrape from web: Repositories with websites (i.e. xarray.dev) have a "Releases" or "What's New" page (i.e. release page for xarray). This can be scraped for information.
Build from code: Sometimes, version-related files (i.e. _version.py) are purposely omitted by a developer (check .gitignore to verify). In this case, per task instance you can build the repository source code locally and extract the version number from the built codebase.

Examples and technical details for each are included in /versioning/. Please refer to them as needed.

Part B: Installation Configurations

Per repository, you must provide installation instructions per version. In constants.py...

In MAP_VERSION_TO_INSTALL, declare a <repo owner/name>: MAP_VERSION_TO_INSTALL_<repo name> key/value pair.
Define a MAP_VERSION_TO_INSTALL_<repo name>, where the key is a version as a string, and the value is a dictionary of installation fields that include the following information:

{
    "python": "3.x", # Required
    "packages": "numpy pandas tensorflow",
    "install": "pip install -e .", # Required
    "pip_packages": ["pytest"],
}

These instructions can typically be inferred from the companion website or CONTRIBUTING.md doc that many open source repositories have.

⚙️ Execution-based Validation

Congrats, you got through the trickiest part! It's smooth sailing from here on out.

We now need to check that the task instances install properly + the problem solved by the task instance is non-trivial. This is taken care of by the engine_validation.py code. Run ./harness/run_validation.sh and supply the following arguments:

instances_path: Path to versioned candidate task instances
log_dir: Path to folder to store task instance-specific execution logs
temp_dir: Path to directory to perform execution
verbose: Whether to print logging traces to standard output.

In practice, you may have to iterate between this step and Installation Configurations a couple times. If your instructions are incorrect/under-specified, it may result in candidate task instances not being installed properly.

🔄 Convert to Task Instances

At this point, we now have all the information necessary to determine if task instances can be used for evaluation with SWE-bench, and save them if they do.

We provide the validation.ipynb Jupyter notebook provided in this folder to make the remaining steps easier. At a high level, it enables the following:

In Monitor Validation, check the results of the ./run_validation.sh step.
In Get [FP]2[FP] Tests, determine which task instances are non-trivial (solves at least one test)
In Create Task Instances .json file, perform some final preprocessing and save your task instances to a .json file.

Thanks for reading! If you have any questions or comments about the details in the article, please feel free to follow up with an issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collection.md

collection.md

Collecting Evaluation Tasks for SWE-Bench

🔍 Selecting a Repository

⛏️ Collecting Candidate Tasks

📙 Specify Execution Parameters

Part A: Versioning

Part B: Installation Configurations

⚙️ Execution-based Validation

🔄 Convert to Task Instances

Files

collection.md

Latest commit

History

collection.md

File metadata and controls

Collecting Evaluation Tasks for SWE-Bench

🔍 Selecting a Repository

⛏️ Collecting Candidate Tasks

📙 Specify Execution Parameters

Part A: Versioning

Part B: Installation Configurations

⚙️ Execution-based Validation

🔄 Convert to Task Instances