A Python package for running and evaluating the Kwaak agent against the SWE-bench dataset, a benchmark for evaluating LLMs on real-world software engineering tasks.
This package provides a complete test harness for evaluating the Kwaak agent against SWE-bench:
- Loads and processes the SWE-bench dataset
- Manages Docker containers for isolated test environments
- Executes test cases with proper environment setup
- Evaluates and grades test results
- Generates submission-ready predictions
Requires Python 3.11 or higher and Docker.
Run the benchmark using uv:
uv run kwaak-bench-swe
This will:
- Load the SWE-bench test dataset
- Take the first 2 items from each repository
- For each test case:
- Create an isolated Docker container
- Set up the test environment
- Apply test patches
- Run the Kwaak agent (with 60-minute timeout)
- Execute test suite
- Evaluate results
- Generate predictions in SWE-bench submission format
# Run a specific test case
uv run kwaak-bench-swe --instance psf__requests-2317
# Evaluate results for a specific trial
uv run kwaak-bench-swe --evaluate psf__requests-2317 --results-path /path/to/results
src/kwaak_bench_swe/
main.py
- Entry point and benchmark orchestrationbenchmark.py
- Benchmark runner and result managementtrial.py
- Test execution and evaluationswe_bench_instance.py
- SWE-bench test case representationdocker_instance.py
- Docker container management
The benchmark generates several outputs in the results
directory:
{benchmark-name}/{trial-name}.json
- Detailed trial results{benchmark-name}/{trial-name}-pre_patch_test_results.txt
- Initial test results{benchmark-name}/{trial-name}-test_results.txt
- Post-patch test results{benchmark-name}/{trial-name}-patch.diff
- Generated patch{benchmark-name}/{trial-name}-report.json
- Evaluation report{benchmark-name}/{trial-name}/agent_result.txt
- Kwaak agent output or timeout messagepredictions.jsonl
- SWE-bench submission format predictions
- Ensure all code is properly typed
- Maintain JSON serialization support for result objects
- Follow the existing pattern of using dataclasses for data structures
- Test Docker container isolation when making changes to test execution