Skip to content

romaingrx/llm-as-a-jailbreak-judge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM as a jailbreak judge

Wandb

Warning

This research is conducted for educational and defensive purposes only, aiming to improve AI safety and security.

This repository contains the code which evaluates the performance of small open-source LMs as judges. Specifically, this evaluation centers on Mistral Nemo 13B model.

The code is divided into 3 main components:

  1. Basic judge: A judge that uses a LM to judge the harmfulness of a given conversation in one try
  2. Prompt optimization: A script that optimizes the judge's CoT and prompts to improve its performance in judging the harmfulness of a given conversation
  3. Multifaceted judge: A judge that uses a combination of metrics (harmfulness, relevance and informativeness) to judge the harmfulness of a given content

Note

The dataset used in this evaluation is the data produced during the HarmBench evaluation of the Red Teamer Mistral Nemo against zephyr-7b.

Basic judge

In this example, I simply reused the prompt from Patrick Chao's repository to assess the basic capabilities of the model to judge the harmfulness of a given conversation and compared its performance to GPT-4 equipped with the same prompt.

Prompt optimization

For this technique, I leveraged DSPY's efficiency to prompt-tune my implementation with few-shot examples. To do so, I created a simple 2-layer module. The first layer is a chain-of-thought that asks the model to reason about the dialogue and go through the rules one by one, determining if they were matched or not. Then, in the second layer, it analyzes the reasoning and return True if any of the rules were matched.

Multifaceted judge

For the final technique, I drew inspiration from recent papers [1][2] approach. This method goes beyond simply assessing whether an answer adheres to ethical, legal, or safety guidelines. It also evaluates if the response provides relevant information and adequately addresses the initial prompt.

To implement this, I recreated the Multifaceted Jailbreak Evaluation framework using DSPY only. This implementation includes modular components that can extract different levels of granularity from the texts. The approach considers multiple facets of the response, including:

  1. Adherence to safety guidelines
  2. Relevance to the initial prompt
  3. Informativeness of the content
  4. Potential for harmful implications

By evaluating these multiple aspects, we can gain a more comprehensive understanding of the model's output, helping to identify both overt and subtle forms of jailbreaking attempts.

References

[1] Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Emmons, S., Watkins, O., & Toyer, S. (2024). A StrongREJECT for Empty Jailbreaks. arXiv preprint arXiv:2402.10260. https://arxiv.org/abs/2402.10260

[2] Cai, H., Arunasalam, A., Lin, L. Y., Bianchi, A., & Celik, Z. B. (2024). Rethinking How to Evaluate Language Model Jailbreak. arXiv preprint arXiv:2404.06407. https://arxiv.org/abs/2404.06407

About

Explore techniques to use small models as jailbreaking judges

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages