forked from swe-bench/swe-bench.github.io
-
Notifications
You must be signed in to change notification settings - Fork 0
/
submit.html
247 lines (244 loc) · 14.3 KB
/
submit.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>SWE-bench</title>
<meta
name="description"
content="SWE-bench: Evaluate Language Models on Open Source Software Tasks"
/>
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />
<meta
name="viewport"
content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no"
/>
<meta property="og:image" content="/logo.png" />
<link rel="shortcut icon" href="favicon.ico" type="image/x-icon" />
<link rel="icon" href="favicon.ico" type="image/x-icon" />
<link rel="stylesheet" href="css/normalize.css" />
<link rel="stylesheet" href="css/fonts.css" />
<link rel="stylesheet" href="css/styles.css" />
<link
rel="stylesheet"
href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.2.0/css/all.min.css"
integrity="..."
crossorigin="anonymous"
/>
<style>
code {
background-color: #ddd;
color: black;
}
h3 {
margin-bottom: 0.5em;
}
li {
margin-bottom: 0.5em;
}
</style>
<!-- Google tag (gtag.js) -->
<script
async
src="https://www.googletagmanager.com/gtag/js?id=G-H9XFCMDPNS"
></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag() {
dataLayer.push(arguments);
}
gtag("js", new Date());
gtag("config", "G-H9XFCMDPNS");
</script>
</head>
<body>
<div style="padding-bottom: 50px">
<section style="background-color: var(--dark_accent_color)">
<div
class="content-wrapper title-wrapper"
style="flex-direction: column;text-align: center;"
>
<h1 style="font-size: 60px; padding-top: 0.4em">Submit to SWE-bench</h1>
<div class="content-wrapper" style="margin-top: 2em">
<a href="index.html">
<button class="outline" style="flex-direction: row; display: flex; justify-content: center; align-items: center;">
<img src="img/swellama.png" style="height: 1.3em; margin-right: 0.4em; margin-bottom: 0.3em;" />
Home
</button>
</a>
<a href="https://arxiv.org/abs/2310.06770">
<button class="outline">
<i class="fa fa-paperclip"></i> Paper
</button>
</a>
<a href="https://github.com/princeton-nlp/SWE-bench">
<button class="outline">
<i class="fab fa-github"></i> Code
</button>
</a>
<a href="viewer.html">
<button class="outline">
<i class="fa fa-chart-simple"></i> Analysis
</button>
</a>
</div>
</div>
</section>
<section class="main-container">
<div class="content-wrapper" style="display: flex; justify-content: center; align-items: center;">
<div style="background-color: black; padding: 1.5em 1em; color: white; border-radius: 1em; text-align: center; width: 80%;">
All official submissions to the SWE-bench leaderboard are maintained at
<a href="https://github.com/swe-bench/experiments/" class="light-blue-link" target="_blank" rel="noopener noreferrer">
<i class="fab fa-github"></i> SWE-bench/experiments
</a>
</div>
</div>
<div class="content-wrapper">
<div class="content-box">
<h3>
Submit to SWE-bench Leaderboard
</h3>
<p>
If you are interested in submitting your model to the <a href="https://www.swebench.com/">SWE-bench Leaderboard</a>, please do the following:
</p>
<ol>
<li>Fork the <a href="https://github.com/swe-bench/experiments">SWE-bench/experiments</a> repository.</li>
<li>Clone the repository. Due to this repository's large diff history, consider using `git clone --depth 1` if cloning takes too long.</li>
<li>Under the split that you evaluate on (<code>evaluation/lite/</code> or <code>evaluation/test</code>), create a new folder with the submission date and the model name (e.g. <code>20240415_sweagent_gpt4</code>).</li>
<li>Within the folder, please include the following files:
<ul>
<li><code>all_preds.jsonl</code>: Model predictions</li>
<li><code>logs/</code>: SWE-bench evaluation artifacts dump</li>
<ul>
<li>Eval. artifacts means 300/2294 (Lite/Test) folders. Each folder (e.g. <code>astropy__astropy-1234</code>) contains:</li>
<ul>
<li><code>eval.sh</code>: The evaluation script</li>
<li><code>patch.diff</code>: The model's generated prediction</li>
<li><code>report.json</code>: Summary of evaluation outcomes for this instance</li>
<li><code>run_instance.log</code>: A log of SWE-bench evaluation steps</li>
<li><code>test_output.txt</code>: An output of running <code>eval.sh</code> on <code>patch.diff</code></li>
</ul>
<li><b>NOTE</b>: You shouldn't have to create any of these files. They should automatically be generated by SWE-bench evaluation.</li>
</ul>
<li><code>metadata.yaml</code>: Metadata for how result is shown on website. Please include the following fields:</li>
<ul>
<li><code>name</code>: The name of your leaderboard entry</li>
<li><code>oss</code>: <code>true</code> if your system is open-source</li>
<li><code>site</code>: URL/link to more information about your system</li>
<li><code>verified</code>: <code>false</code> (See below for results verification)</li>
</ul>
<li><code>trajs/</code>: Reasoning trace reflecting how your system solved the problem</li>
<ul>
<li>Submit one reasoning trace per task instance. The reasoning trace should show all of the steps your system took while solving the task. If your system outputs thoughts or comments during operation, they should be included as well.</li>
<li>The reasoning trace can be represented with any text based file format (e.g. <code>md</code>, <code>json</code>, <code>yaml</code>)</li>
<li>Ensure the task instance ID is in the name of the corresponding reasoning trace file.</li>
<li>For an example, see <a href="https://github.com/swe-bench/experiments/tree/main/evaluation/lite/20240402_sweagent_gpt4/logs">SWE-agent + GPT 4 Turbo</a></li>
</ul>
<li><code>README.md</code>: Include anything you'd like to share about your model here!</li>
</ul>
</li>
<li>Run <code>python -m analysis.get_results evaluation/<split>/<date + model></date></code></li>
<li>Create a pull request to the SWE-bench/experiments repository with the new folder.</li>
</ol>
<p>
You can refer to this <a href="https://github.com/princeton-nlp/SWE-bench/blob/main/tutorials/evaluation.md">tutorial</a> for a quick overview of how to evaluate your model on SWE-bench.
</p>
</div>
</div>
<div class="content-wrapper">
<div class="content-box">
<h3>
Submission Guidelines
</h3>
<p>
Please note that we consider an eligible submission to the SWE-bench [Lite] leaderboard to satisfy these criteria:
</p>
<ol>
<li>The use of the <code>hints_text</code> field is <i>not</i> allowed. See our explanation <a href="https://github.com/princeton-nlp/SWE-bench/issues/133">here</a>.</li>
<li>The result should be pass@1. There should be one execution log per task instance for all 2294 task instances.</li>
<li>The result should <i>not</i> be in the "Oracle" retrieval setting. The agent cannot be told the correct files to edit, where "correct" refers to the files modified by the reference solution patch.</li>
</ol>
</div>
</div>
<div class="content-wrapper">
<div class="content-box">
<h3>Verify Your Results</h3>
<p>
The <i>Verified</i> check ✓ indicates that we (the SWE-bench team) received access to the model and were able to reproduce the patch generations.
</p>
<p style="margin-top:0.5em;">
If you are interested in receiving the "verified" checkmark ✓ on your submission, please do the following:
</p>
<ol>
<li>Create an issue</li>
<li>In the issue, provide us instructions on how to run your model on SWE-bench.</li>
<li>We will run your model on a random subset of SWE-bench and verify the results.</li>
</ol>
</div>
</div>
<div class="content-wrapper">
<div class="content-box" id="reasoning-traces">
<h3>Reasoning Traces</h3>
<p>
(07/29/2024) We have updated the SWE-bench leaderboard submission criteria to require the inclusion of <i>reasoning traces</i>.
The goal of this requirement is to provide the community with more insight into how cutting edge methods work without requiring a code release (although the latter is still highly encouraged!).
</p>
<p><b>What is a reasoning trace?</b></p>
<p>
A reasoning trace is a text-based file that describes the steps your system took to solve a task instance.
It should provide a detailed account of the reasoning process that your system used to arrive at its solution.
</p>
<p>
We purposely do not explicitly define reasoning traces in a strict, explicit format.
</p>
<p>
We do have some guidelines. the reasoning trace should be...
<ul>
<li>Human-readable.</li>
<li>Reflects the intermediate steps your system took that led to the final solution.</li>
<li>Generated <i>with</i> the inference process, not post-hoc.</li>
</ul>
</p>
<p>
We do not require reasoning traces to be...
<ul>
<li>In a specific file format (e.g. <code>json</code>, <code>yaml</code>, <code>md</code>)</li>
<li>Conform to a specific problem solving style (e.g. agentic, procedural, etc.)</li>
</ul>
</p>
<p>
A simple solution to this? When running inference, simply log the intermediate output generated by your system.
For an example, see <a href="https://github.com/swe-bench/experiments/tree/main/evaluation/lite/20240402_sweagent_gpt4/logs">SWE-agent + GPT-4 Turbo Trajectories</a>.
In short, our requirements for what a reasoning trace should specific look like are non-specific.
We trust you to provide a detailed account of how your system solved the task instance.
</p>
<p><b>Why are we requiring it?</b></p>
<p>
We believe that reasoning traces can provide valuable insights into how cutting edge methods work without requiring a code release.
</p>
<p>
As of this post (7/29/2024), we have received many submissions that have pushed the state of the art on SWE-bench, which is exciting to see!
</p>
<p>
However, we have also found that the top-performing submissions to SWE-bench typically have not open sourced their code nor been verified.
We recognize that some leaderboard participants (1) would like to add an entry to SWE-bench but (2) do not want to release their code or proprietary system, which is completely understandable.
On the other hand, given that open source systems submitted to SWE-bench have propelled the development of closed-source participants, we would like to continuing promoting development on SWE-bench as a community-level collaborative process.
</p>
<p>
Therefore, we believe that providing reasoning traces serves as a valuable compromise between these two groups.
</p>
<p><b>What should I submit?</b></p>
<ol>
<li>Create a <code>trajs/</code> folder in your submission directory.</li>
<li>Within this folder, upload a reasoning trace per task instance that your system generated a prediction for.</li>
<li>Make sure the naming convention of the reasoning trace file reflects the SWE-bench task instance it corresponds to. (e.g. <code>astropy__astropy-1234.md</code>)</li>
</ol>
<p>
We will review the reasoning traces you submit.
We plan to only accept submissions with reasoning traces for the SWE-bench leaderboard.
</p>
</div>
</div>
</section>
</div>
</body>
</html>