John Yang • November 1, 2023
In this tutorial, we explain how to use the SWE-Bench repository to collect evaluation task instances from GitHub repositories.
SWE-bench's collection pipeline is currently designed to target PyPI packages. We hope to expand SWE-bench to more repositories and languages in the future.
SWE-bench constructs task instances from issues and pull requests. A good repository to source evaluation instances from should have many issues and pull requests. A point of reference for repositories that fit this bill would be the Top PyPI packages website.
Once you've selected a repository, use the /collect/make_repo/make_repo.sh
script to create a mirror of the repository, like so:
./collect/make_repo/make_repo.sh scikit-learn/scikit-learn
Once you have cloned the repository, you can then use the collect/get_tasks_pipeline.py
script to collect pull requests and convert them to candidate task instances.
Supply the repository name(s) and logging folders as arguments to the run_get_tasks_pipeline.sh
script, then run it like so:
./collect/run_get_tasks_pipeline.sh
At this point, for a repository, you should have...
- A mirror clone of the repository under the SWE-bench organization.
- A
<repo name>-prs.jsonl
file containing all the repository's PRs. - A
<repo name>-task-instances.jsonl
file containing all the candidate task instances.
This step is the most manual of all parts. To create an appropriate execution environment for task instances from a new repository, you must do the following steps:
- Assign a repository-specific version (i.e.
1.2
) to every task instance. - Specify repository+version-specific installation commands in
harness/constants.py
.
Determining a version for each task instance can be accomplished in a number of ways, depending on the availability + feasability with respect to each repository.
- Scrape from code: A version is explicitly specified in the codebase (in
__init__.py
or_version.py
for PyPI packages). - Scrape from web: Repositories with websites (i.e. xarray.dev) have a "Releases" or "What's New" page (i.e. release page for xarray). This can be scraped for information.
- Build from code: Sometimes, version-related files (i.e.
_version.py
) are purposely omitted by a developer (check.gitignore
to verify). In this case, per task instance you can build the repository source code locally and extract the version number from the built codebase.
Examples and technical details for each are included in /versioning/
. Please refer to them as needed.
Per repository, you must provide installation instructions per version. In constants.py
...
- In
MAP_VERSION_TO_INSTALL
, declare a<repo owner/name>: MAP_VERSION_TO_INSTALL_<repo name>
key/value pair. - Define a
MAP_VERSION_TO_INSTALL_<repo name>
, where the key is a version as a string, and the value is a dictionary of installation fields that include the following information:
{
"python": "3.x", # Required
"packages": "numpy pandas tensorflow",
"install": "pip install -e .", # Required
"pip_packages": ["pytest"],
}
These instructions can typically be inferred from the companion website or CONTRIBUTING.md
doc that many open source repositories have.
Congrats, you got through the trickiest part! It's smooth sailing from here on out.
We now need to check that the task instances install properly + the problem solved by the task instance is non-trivial.
This is taken care of by the engine_validation.py
code.
Run ./harness/run_validation.sh
and supply the following arguments:
instances_path
: Path to versioned candidate task instanceslog_dir
: Path to folder to store task instance-specific execution logstemp_dir
: Path to directory to perform executionverbose
: Whether to print logging traces to standard output.
In practice, you may have to iterate between this step and Installation Configurations a couple times. If your instructions are incorrect/under-specified, it may result in candidate task instances not being installed properly.
At this point, we now have all the information necessary to determine if task instances can be used for evaluation with SWE-bench, and save them if they do.
We provide the validation.ipynb
Jupyter notebook provided in this folder to make the remaining steps easier.
At a high level, it enables the following:
- In Monitor Validation, check the results of the
./run_validation.sh
step. - In Get [FP]2[FP] Tests, determine which task instances are non-trivial (solves at least one test)
- In Create Task Instances
.json
file, perform some final preprocessing and save your task instances to a.json
file.
Thanks for reading! If you have any questions or comments about the details in the article, please feel free to follow up with an issue.