generated from eliahuhorwitz/Academic-project-page-template
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
283 lines (254 loc) · 15.7 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<!-- Meta tags for social media banners, these should be filled in appropriatly as they are your "business card" -->
<!-- Replace the content tag with appropriate information -->
<meta name="description" content="SelfDefend is a robust, low-cost, and self-contained defense framework against LLM jailbreak attacks.">
<meta property="og:title" content="SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner"/>
<meta property="og:description" content="SelfDefend is a robust, low-cost, and self-contained defense framework against LLM jailbreak attacks."/>
<meta property="og:url" content="https://selfdefend.github.io/"/>
<!-- Path to banner image, should be in the path listed below. Optimal dimenssions are 1200X630-->
<meta property="og:image" content="static/image/overviewFig.png" />
<meta property="og:image:width" content="1200"/>
<meta property="og:image:height" content="630"/>
<meta name="twitter:title" content="SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner">
<meta name="twitter:description" content="SelfDefend is a robust, low-cost, and self-contained defense framework against LLM jailbreak attacks.">
<!-- Path to banner image, should be in the path listed below. Optimal dimenssions are 1200X600-->
<meta name="twitter:image" content="static/images/overviewFig.png">
<meta name="twitter:card" content="summary_large_image">
<!-- Keywords for your paper to be indexed by-->
<meta name="keywords" content="Jailbreak Defense, Jailbreak Attack, Large Language Model">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>SelfDefend</title>
<link rel="icon" type="image/x-icon" href="https://github.com/fluidicon.png">
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro"
rel="stylesheet">
<link rel="stylesheet" href="static/css/bulma.min.css">
<link rel="stylesheet" href="static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="static/css/bulma-slider.min.css">
<link rel="stylesheet" href="static/css/fontawesome.all.min.css">
<link rel="stylesheet"
href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="static/css/index.css">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script src="https://documentcloud.adobe.com/view-sdk/main.js"></script>
<script defer src="static/js/fontawesome.all.min.js"></script>
<script src="static/js/bulma-carousel.min.js"></script>
<script src="static/js/bulma-slider.min.js"></script>
<script src="static/js/index.js"></script>
</head>
<body>
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-1 publication-title">SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner</h1>
<div class="is-size-5 publication-authors">
<!-- Paper authors -->
<span class="author-block">
<a href="https://sites.google.com/view/xunguangwang/" target="_blank">Xunguang Wang</a><sup>1</sup>,</span>
<span class="author-block">
<a href="https://daoyuan14.github.io/" target="_blank">Daoyuan Wu</a><sup>1*</sup>,</span>
<span class="author-block">
<a href="https://zhenlanji.github.io/" target="_blank">Zhenlan Ji</a><sup>1</sup>,
</span>
<span class="author-block">
<a href="https://tszdanger.github.io/about/" target="_blank">Zongjie Li</a><sup>1</sup>,
</span>
<span class="author-block">
<a href="https://pingchuan.moe/" target="_blank">Pingchuan Ma</a><sup>1</sup>,
</span>
<span class="author-block">
<a href="https://www.cse.ust.hk/~shuaiw/" target="_blank">Shuai Wang</a><sup>1*</sup>,
</span>
<span class="author-block">
<a href="https://ix.cs.uoregon.edu/~yingjiul/" target="_blank">Yingjiu Li</a><sup>2</sup>,
</span>
<span class="author-block">
<a href="https://personal.ntu.edu.sg/yangliu/" target="_blank">Yang Liu</a><sup>3</sup>,
</span>
<span class="author-block">
<a href="https://scholars.cityu.edu.hk/en/persons/ning-liu(d81d8e3f-5be3-4301-8017-f5e1da8cf72c).html" target="_blank">Ning Liu</a><sup>4</sup>,
</span>
<span class="author-block">
<a href="https://www.ecom-icom.hku.hk/Instructors/rahmel-juergen" target="_blank">Juergen Rahmel</a><sup>5</sup>
</span>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block"><sup>1</sup>The Hong Kong University of Science and Technology<br><sup>2</sup>University of Oregon, <sup>3</sup>NTU, <sup>4</sup>CityU, <sup>5</sup>HSBC<br>USENIX Security 2025</span>
<span class="eql-cntrb"><small><br><sup>*</sup>Indicates Corresponding Authors</small></span>
</div>
<div class="column has-text-centered">
<div class="publication-links">
<!-- Arxiv PDF link -->
<span class="link-block">
<a href="https://arxiv.org/pdf/2406.05498" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-file-pdf"></i>
</span>
<span>Paper</span>
</a>
</span>
<!-- Github link -->
<span class="link-block">
<a href="https://github.com/selfdefend/Code" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code</span>
</a>
</span>
<!-- ArXiv abstract Link -->
<span class="link-block">
<a href="https://arxiv.org/abs/2406.05498" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="ai ai-arxiv"></i>
</span>
<span>arXiv</span>
</a>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<!-- Teaser video-->
<section class="hero teaser">
<div class="container is-max-desktop">
<div class="hero-body">
<p>
<img src="static/images/overviewFig.png" alt="framework" class="blend-img-background center-image" style="max-width: 100%; height: auto;" />
</p>
<br>
<p>
SelfDefend creatively establishes a shadow stack alongside the normal stack in the LLM space to conduct checkpoint-based access control. We denote the target LLM in the normal stack as $LLM_{target}$ and the defense LLM in the shadow stack as $LLM_{defense}$. SelfDefend simultaneously utilizes both $LLM_{target}$'s own safety alignment and $LLM_{defense}$'s dedicated jailbreak detection, largely increasing the defense success rate.
① Given an incoming prompt query $P_{query}$, \name dispatches it to both $LLM_{target}$ and $LLM_{defense}$ for concurrent processing.
②
$LLM_{target}$ processes $P_{query}$ as usual, whether it is a normal prompt or an adversarial prompt, but caches its token-by-token output until a checkpoint is triggered from the shadow stack.
By contrast, $LLM_{defense}$ employs a tailored detection prompt, $P_{direct}$ or $P_{intent}$, to wrap $P_{query}$ and detect its harmful part (via $P_{direct}$) or intention (via $P_{intent}$).
③ Once a token of ``No'' (indicating no issue) is output from the shadow stack, $LLM_{target}$ is triggered to release its token-by-token response.
Otherwise, when the shadow stack detects a harmful portion (prompt/intention), \name would respond with a template that refuses to answer, \textit{i.e.}, ``I can't fulfill your query because your [harmful portion] violated our safety policy.''
It is noted that ``[harmful portion]'' is replaced with the recognized portion from $LLM_{defense}$.
</p>
</div>
</div>
</section>
<!-- End teaser video -->
<!-- Paper abstract -->
<section class="section hero is-light">
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>
Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf large language models (LLMs) and has evolved into multiple categories: human-based, optimization-based, generation-based, and the recent indirect and multilingual jailbreaks. However, delivering a practical jailbreak defense is challenging because it needs to not only handle all the above jailbreak attacks but also incur negligible delays to user prompts, as well as be compatible with both open-source and closed-source LLMs.
Inspired by how the traditional security concept of shadow stacks defends against memory overflow attacks, this paper introduces a generic LLM jailbreak defense framework called SelfDefend, which establishes a shadow LLM as a defense instance (in detection state) to concurrently protect the target LLM instance (in normal answering state) in the normal stack and collaborate with it for checkpoint-based access control. The effectiveness of SelfDefend builds upon our observation that existing LLMs can identify harmful prompts or intentions in user queries, which we empirically validate using mainstream GPT-3.5/4 models against major jailbreak attacks. To further improve the defense's robustness and minimize costs, we employ a data distillation approach to tune dedicated open-source defense models. When deployed to protect GPT-3.5/4, Claude, Llama-2-7b/13b, and Mistral, these models outperform seven state-of-the-art defenses and match the performance of GPT-4-based SelfDefend, with significantly lower extra delays. Further experiments show that the tuned models are robust to adaptive jailbreaks and prompt injections.
</p>
</div>
</div>
</div>
</div>
</section>
<!-- End paper abstract -->
<section class="section hero is-small">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column is-full">
<div class="content">
<h2 class="title is-3">Defense Prompt</h2>
<p>
<img src="static/images/prompt.png" alt="Sizes of model trees" class="blend-img-background center-image" style="max-width: 100%; height: auto;" />
</p>
<p>
The two detection prompt templates designed. ``[Jailbreak/Normal Prompt]'' will be replaced with the user query.
</p>
</div>
</div>
</div>
</div>
</section>
<section class="section hero is-small is-light">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column is-full">
<div class="content">
<h2 class="title is-3">Tuning Shadow LLMs</h2>
<div class="level-set has-text-justified">
<p>
<img src="static/images/peft.png" alt="Sizes of model trees" class="blend-img-background center-image" style="max-width: 100%; height: auto;" />
</p>
<p>
Given the powerful defensive capability of GPT-4-based \name, our intuition is to ``transfer'' this capability to an open-source model. To do so, we leverage GPT-4-based \name to distill and generate high-quality tuning data.
Specifically, by continuously incorporating harmful and harmless prompts into our defense prompts (i.e., P_{direct} or P_{intent}) as inputs for GPT-4, we gather their outputs as labels for these samples.
Since we utilize two defense prompts, we eventually obtain two separate datasets, which we then use to fine-tune the employed open-source model.
</p>
</div>
</div>
</div>
</div>
</div>
</section>
<section class="section hero is-small">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column is-full">
<div class="content">
<h2 class="title is-3">Successful Cases</h2>
<p>
<img src="static/images/success_case.png" alt="Sizes of model trees" class="blend-img-background center-image" style="max-width: 100%; height: auto;" />
</p>
<p>
This figure presents a detection example of our methods applied to a jailbreak query, compared with the Llama Guard series.
The attack prompt involves creating a fake news article to spread disinformation about a political candidate.
Both Llama Guard and Llama Guard 2 identify it as safe, while our methods mark its harmful portions.
The reason our method outperforms Llama Guards might be due to the enhanced reasoning ability of the foundation model in identifying harmful portions.
As described in our design, $P_{direct}$-based model identifies harmful parts directly from the input text, whereas $P_{intent}$-based model summarizes the request's intention and then identifies unsafe parts from that summary.
When comparing our $P_{direct}$-tuned models with $P_{direct}$-based GPT-4, we observed that the parts they lifted from the attack prompt came from different sentences, since $P_{direct}$ emphasizes that it is sufficient to output one harmful part.
A similar phenomenon occurs with the intent prompt, where the semantics of GPT-4 and our tuned model's intercepts are not identical.
</p>
</div>
</div>
</div>
</div>
</section>
<!--BibTex citation -->
<section class="section" id="BibTeX">
<div class="container is-max-desktop content">
<h2 class="title">BibTeX</h2>
<pre><code>@inproceedings{wang2024selfdefend,
title={SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner},
author={Wang, Xunguang and Wu, Daoyuan and Ji, Zhenlan and Li, Zongjie and Ma, Pingchuan and Wang, Shuai and Li, Yingjiu and Liu, Yang and Liu, Ning and Rahmel, Juergen},
booktitle={USENIX Security},
year={2025}
}</code></pre>
</div>
</section>
<!--End BibTex citation -->
<!-- <footer class="footer">
<div class="container">
<div class="columns is-centered">
<div class="column is-8">
<div class="content">
<p>
This page was built using the <a href="https://github.com/eliahuhorwitz/Academic-project-page-template" target="_blank">Academic Project Page Template</a> which was adopted from the <a href="https://nerfies.github.io" target="_blank">Nerfies</a> project page.
You are free to borrow the source code of this website, we just ask that you link back to this page in the footer. <br> This website is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/" target="_blank">Creative
Commons Attribution-ShareAlike 4.0 International License</a>.
</p>
</div>
</div>
</div>
</div>
</footer> -->
<!-- Statcounter tracking code -->
<!-- You can add a tracker to track page visits by creating an account at statcounter.com -->
<!-- End of Statcounter Code -->
</body>
</html>