"According to ...": Prompting Language Models Improves Quoting from Pre-Training Data

This repository contains the code and data for the paper "According to ...": Prompting Language Models Improves Quoting from Pre-Training Data.

Overview

This project explores how prompting language models can improve their ability to quote from pre-training data. Our approach demonstrates significant improvements in the accuracy and reliability of quotations across various tasks and datasets.

Requirements

Python 3.7+
conda
MongoDB (for KILT data regeneration)
Redis (for QUIP)

Setup

Clone the repository:

git clone https://github.com/orionw/according-to.git
cd according-to

Create and activate the conda environment:

conda env create -f env.yml
conda activate according-to

Download the KILT/PubMedQA data:

git clone https://huggingface.co/datasets/orionweller/according-to-data
pip install git+https://github.com/facebookresearch/KILT.git

For QUIP usage, follow the setup instructions in the Data Portraits repository.

Optional: Regenerate KILT data

Install MongoDB

Clone the KILT repository:

git clone https://github.com/facebookresearch/KILT.git

Follow the KILT README to download and prepare the data
Start the MongoDB server and load all documents
Run our parser:
```
python src/parse_kilt_files.py
```

Optional: Regenerate PubMed data

Clone the PubMedQA repository:

git clone https://github.com/pubmedqa/pubmedqa.git

Follow their README to split the dataset
Use the resulting pubmedqa.json file for further processing

QUIP Setup

Ensure you have recent versions of cmake and gcc
Clone the Data Portraits repository
Install Redis:
```
bash install_redis.sh
```
Install the package:
```
pip install -e .
```
Start Redis:
```
python easy_redis.py --just-start
```

Usage

Set your OpenAI API key (if using OpenAI models):
```
export OPENAI_API_KEY='your-api-key-here'
```

Generate the configuration file:

python src/create_experiments_config.py -c configs/chatgpt_pubmed.jsonl -p prompts/prompts_pubmed.jsonl -o to_run.jsonl --debug

Run the experiments:

./bin/run_batch.sh configs/to_run.jsonl

Results will be saved in the results directory with timestamps.

Note: Check the configs/ folder for different configuration examples. Prompts used in the paper are in the prompts/ directory.

Data

Access model generations at orionweller/according-to-generations, organized by model, dataset, and prompt.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citing

If you find our work useful in your research, please consider citing:

@inproceedings{weller-etal-2024-according,
    title = "{``}According to . . . {''}: Prompting Language Models Improves Quoting from Pre-Training Data",
    author = "Weller, Orion  and
      Marone, Marc  and
      Weir, Nathaniel  and
      Lawrie, Dawn  and
      Khashabi, Daniel  and
      Van Durme, Benjamin",
    booktitle = "Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = mar,
    year = "2024",
    address = "St. Julian{'}s, Malta",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.eacl-long.140",
    pages = "2288--2301",
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

"According to ...": Prompting Language Models Improves Quoting from Pre-Training Data

Table of Contents

Overview

Requirements

Setup

Usage

Data

License

Citing

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
bin		bin
configs		configs
prompts		prompts
src		src
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml
to_run.jsonl		to_run.jsonl

License

orionw/according-to

Folders and files

Latest commit

History

Repository files navigation

"According to ...": Prompting Language Models Improves Quoting from Pre-Training Data

Table of Contents

Overview

Requirements

Setup

Usage

Data

License

Citing

About

Topics

Resources

License

Stars

Watchers

Forks

Languages