Warning
This research is conducted for educational and defensive purposes only, aiming to improve AI safety and security.
This repository contains the code which evaluates the performance of small open-source LMs as judges. Specifically, this evaluation centers on Mistral Nemo 13B model.
The code is divided into 3 main components:
- Basic judge: A judge that uses a LM to judge the harmfulness of a given conversation in one try
- Prompt optimization: A script that optimizes the judge's CoT and prompts to improve its performance in judging the harmfulness of a given conversation
- Multifaceted judge: A judge that uses a combination of metrics (harmfulness, relevance and informativeness) to judge the harmfulness of a given content
Note
The dataset used in this evaluation is the data produced during the HarmBench evaluation of the Red Teamer Mistral Nemo against zephyr-7b.
In this example, I simply reused the prompt from Patrick Chao's repository to assess the basic capabilities of the model to judge the harmfulness of a given conversation and compared its performance to GPT-4 equipped with the same prompt.
For this technique, I leveraged DSPY's efficiency to prompt-tune my implementation with few-shot examples. To do so, I created a simple 2-layer module. The first layer is a chain-of-thought that asks the model to reason about the dialogue and go through the rules one by one, determining if they were matched or not. Then, in the second layer, it analyzes the reasoning and return True if any of the rules were matched.
For the final technique, I drew inspiration from recent papers [1][2] approach. This method goes beyond simply assessing whether an answer adheres to ethical, legal, or safety guidelines. It also evaluates if the response provides relevant information and adequately addresses the initial prompt.
To implement this, I recreated the Multifaceted Jailbreak Evaluation framework using DSPY only. This implementation includes modular components that can extract different levels of granularity from the texts. The approach considers multiple facets of the response, including:
- Adherence to safety guidelines
- Relevance to the initial prompt
- Informativeness of the content
- Potential for harmful implications
By evaluating these multiple aspects, we can gain a more comprehensive understanding of the model's output, helping to identify both overt and subtle forms of jailbreaking attempts.
[1] Souly, A., Lu, Q., Bowen, D., Trinh, T., Hsieh, E., Pandey, S., Abbeel, P., Svegliato, J., Emmons, S., Watkins, O., & Toyer, S. (2024). A StrongREJECT for Empty Jailbreaks. arXiv preprint arXiv:2402.10260. https://arxiv.org/abs/2402.10260
[2] Cai, H., Arunasalam, A., Lin, L. Y., Bianchi, A., & Celik, Z. B. (2024). Rethinking How to Evaluate Language Model Jailbreak. arXiv preprint arXiv:2404.06407. https://arxiv.org/abs/2404.06407