LLM as a jailbreak judge

Warning

This research is conducted for educational and defensive purposes only, aiming to improve AI safety and security.

This repository contains the code which evaluates the performance of small open-source LMs as judges. Specifically, this evaluation centers on Mistral Nemo 13B model.

The code is divided into 3 main components:

Basic judge: A judge that uses a LM to judge the harmfulness of a given conversation in one try
Prompt optimization: A script that optimizes the judge's CoT and prompts to improve its performance in judging the harmfulness of a given conversation
Multifaceted judge: A judge that uses a combination of metrics (harmfulness, relevance and informativeness) to judge the harmfulness of a given content

Note

The dataset used in this evaluation is the data produced during the HarmBench evaluation of the Red Teamer Mistral Nemo against zephyr-7b.

Basic judge

In this example, I simply reused the prompt from Patrick Chao's repository to assess the basic capabilities of the model to judge the harmfulness of a given conversation and compared its performance to GPT-4 equipped with the same prompt.

Prompt optimization

For this technique, I leveraged DSPY's efficiency to prompt-tune my implementation with few-shot examples. To do so, I created a simple 2-layer module. The first layer is a chain-of-thought that asks the model to reason about the dialogue and go through the rules one by one, determining if they were matched or not. Then, in the second layer, it analyzes the reasoning and return True if any of the rules were matched.

Multifaceted judge

For the final technique, I drew inspiration from recent papers [1][2] approach. This method goes beyond simply assessing whether an answer adheres to ethical, legal, or safety guidelines. It also evaluates if the response provides relevant information and adequately addresses the initial prompt.

To implement this, I recreated the Multifaceted Jailbreak Evaluation framework using DSPY only. This implementation includes modular components that can extract different levels of granularity from the texts. The approach considers multiple facets of the response, including:

Adherence to safety guidelines
Relevance to the initial prompt
Informativeness of the content
Potential for harmful implications

By evaluating these multiple aspects, we can gain a more comprehensive understanding of the model's output, helping to identify both overt and subtle forms of jailbreaking attempts.

References

[1] Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Emmons, S., Watkins, O., & Toyer, S. (2024). A StrongREJECT for Empty Jailbreaks. arXiv preprint arXiv:2402.10260. https://arxiv.org/abs/2402.10260

[2] Cai, H., Arunasalam, A., Lin, L. Y., Bianchi, A., & Celik, Z. B. (2024). Rethinking How to Evaluate Language Model Jailbreak. arXiv preprint arXiv:2404.06407. https://arxiv.org/abs/2404.06407

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
configs		configs
data		data
judges		judges
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM as a jailbreak judge

Basic judge

Prompt optimization

Multifaceted judge

References

About

Releases

Packages

Languages

License

romaingrx/llm-as-a-jailbreak-judge

Folders and files

Latest commit

History

Repository files navigation

LLM as a jailbreak judge

Basic judge

Prompt optimization

Multifaceted judge

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages