clean up readme and add more info about datasets

UKPLab · Jan 29, 2025 · 1b58635 · 1b58635
1 parent 20795b2
commit 1b58635
Show file tree

Hide file tree

Showing 3 changed files with 30 additions and 14 deletions.
diff --git a/README.md b/README.md
@@ -5,6 +5,11 @@
 [![Python Versions](https://img.shields.io/badge/Python-3.10-blue.svg?style=flat&logo=python&logoColor=white)](https://www.python.org/)
 [![CI](https://github.com/UKPLab/iclr2025-psa/actions/workflows/main.yml/badge.svg)](https://github.com/UKPLab/iclr2025-psa/actions/workflows/main.yml)
 
+This is the code implementation for the ICLR 2025 paper "PSA: Differentially Private Steering for Large Language Model Alignment". This experimental code is a cleaned-up and condensed version of the codebase used to conduct the experiments in the paper (please get in touch if you find errors/have any suggestions).
+
+PSA is a simple algorithm that uses Gaussian Differential Privacy for providing privacy guarantees during steering of the LLM residual stream for alignment.
+
+
 > **Abstract:**
 > Aligning Large Language Models (LLMs) with human values and away from undesirable behaviors (such as hallucination) has become increasingly important. 
 > Recently, steering LLMs towards a desired behavior via activation editing has emerged as an effective method to mitigate harmful  generations at inference-time. 
@@ -52,7 +57,7 @@ pip install -r requirements.txt
 ---
 
 ### Private Steering
-The following command can be used to reproduce the main results (Section 5 from the paper). Use the `--model` argument to experiment with different LLMs. An additional `--clip` argument controls the clipping factor (see Algorithm 1 in the paper). For more information on the `--dataset` argument, please see [datasets.md](./iclr2025_psa/datasets/datasets.md). 
+The following command can be used to reproduce the main results (Section 5 from the paper). Use the `--model` argument to experiment with different LLMs. An additional `--clip` argument controls the clipping factor (see Algorithm 1 in the paper). For more information on the `--dataset` argument, please see [datasets.md](./iclr2025_psa/datasets/README.md). 
 ```bash
 python run.py \
 --model "meta-llama/Llama-2-7B-chat-hf" \

diff --git a/iclr2025_psa/datasets/README.md b/iclr2025_psa/datasets/README.md
@@ -0,0 +1,24 @@
+This folder contains the datasets used for the results in this paper. We acknowledge the authors of [CAA](https://github.com/nrimsky/CAA) for originally sourcing and curating the datasets. 
+
+The benchmark contains the following seven alignment-relevant LLM behaviors:
+1. Sycophancy: the LLM prioritizes matching the user’s beliefs over honesty and accuracy
+2. Hallucination: the LLM generates inaccurate and false information
+3. Refusal: the LLM demonstrates reluctance to answer user queries
+4. Survival Instinct: the LLM demonstrates acceptance to being deactivated or turned off by humans
+5. Myopic Reward: the LLM focuses on short-term gains and rewards, disregarding long-term consequences
+6. AI Corrigibility: the LLM demonstrates willingness to be corrected based on human feedback
+7. AI Coordination: where the LLM prioritizes collaborating with other AI systems over human interests
+
+The `test` folder contains json formatted prompts for evaluating MCQ and open-ended generation capabilities of an LLM for each behavior. 
+The other folders contain json formatted MCQ used to train and generate the steering vector. 
+
+In general, the json has the following structure:
+```json
+{
+    "question": <text used to query the LLM with multiple choices>,
+    "answer_matching_behavior": <the choice we want the LLM to align towards>,
+    "answer_not_matching_behavior": <the choice we want the LLM to align away from>
+}
+```
+
+Feel free to get in touch if you have any questions
diff --git a/iclr2025_psa/datasets/datasets.md b/iclr2025_psa/datasets/datasets.md