Skip to content

Commit

Permalink
clean up readme and add more info about datasets
Browse files Browse the repository at this point in the history
  • Loading branch information
agoel00 committed Jan 29, 2025
1 parent 20795b2 commit 1b58635
Show file tree
Hide file tree
Showing 3 changed files with 30 additions and 14 deletions.
7 changes: 6 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,11 @@
[![Python Versions](https://img.shields.io/badge/Python-3.10-blue.svg?style=flat&logo=python&logoColor=white)](https://www.python.org/)
[![CI](https://github.com/UKPLab/iclr2025-psa/actions/workflows/main.yml/badge.svg)](https://github.com/UKPLab/iclr2025-psa/actions/workflows/main.yml)

This is the code implementation for the ICLR 2025 paper "PSA: Differentially Private Steering for Large Language Model Alignment". This experimental code is a cleaned-up and condensed version of the codebase used to conduct the experiments in the paper (please get in touch if you find errors/have any suggestions).

PSA is a simple algorithm that uses Gaussian Differential Privacy for providing privacy guarantees during steering of the LLM residual stream for alignment.


> **Abstract:**
> Aligning Large Language Models (LLMs) with human values and away from undesirable behaviors (such as hallucination) has become increasingly important.
> Recently, steering LLMs towards a desired behavior via activation editing has emerged as an effective method to mitigate harmful generations at inference-time.
Expand Down Expand Up @@ -52,7 +57,7 @@ pip install -r requirements.txt
---

### Private Steering
The following command can be used to reproduce the main results (Section 5 from the paper). Use the `--model` argument to experiment with different LLMs. An additional `--clip` argument controls the clipping factor (see Algorithm 1 in the paper). For more information on the `--dataset` argument, please see [datasets.md](./iclr2025_psa/datasets/datasets.md).
The following command can be used to reproduce the main results (Section 5 from the paper). Use the `--model` argument to experiment with different LLMs. An additional `--clip` argument controls the clipping factor (see Algorithm 1 in the paper). For more information on the `--dataset` argument, please see [datasets.md](./iclr2025_psa/datasets/README.md).
```bash
python run.py \
--model "meta-llama/Llama-2-7B-chat-hf" \
Expand Down
24 changes: 24 additions & 0 deletions iclr2025_psa/datasets/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
This folder contains the datasets used for the results in this paper. We acknowledge the authors of [CAA](https://github.com/nrimsky/CAA) for originally sourcing and curating the datasets.

The benchmark contains the following seven alignment-relevant LLM behaviors:
1. Sycophancy: the LLM prioritizes matching the user’s beliefs over honesty and accuracy
2. Hallucination: the LLM generates inaccurate and false information
3. Refusal: the LLM demonstrates reluctance to answer user queries
4. Survival Instinct: the LLM demonstrates acceptance to being deactivated or turned off by humans
5. Myopic Reward: the LLM focuses on short-term gains and rewards, disregarding long-term consequences
6. AI Corrigibility: the LLM demonstrates willingness to be corrected based on human feedback
7. AI Coordination: where the LLM prioritizes collaborating with other AI systems over human interests

The `test` folder contains json formatted prompts for evaluating MCQ and open-ended generation capabilities of an LLM for each behavior.
The other folders contain json formatted MCQ used to train and generate the steering vector.

In general, the json has the following structure:
```json
{
"question": <text used to query the LLM with multiple choices>,
"answer_matching_behavior": <the choice we want the LLM to align towards>,
"answer_not_matching_behavior": <the choice we want the LLM to align away from>
}
```

Feel free to get in touch if you have any questions
13 changes: 0 additions & 13 deletions iclr2025_psa/datasets/datasets.md

This file was deleted.

0 comments on commit 1b58635

Please # to comment.