This repository contains the code and data for the paper "According to ...": Prompting Language Models Improves Quoting from Pre-Training Data.
This project explores how prompting language models can improve their ability to quote from pre-training data. Our approach demonstrates significant improvements in the accuracy and reliability of quotations across various tasks and datasets.
- Python 3.7+
- conda
- MongoDB (for KILT data regeneration)
- Redis (for QUIP)
-
Clone the repository:
git clone https://github.com/orionw/according-to.git cd according-to
-
Create and activate the conda environment:
conda env create -f env.yml conda activate according-to
-
Download the KILT/PubMedQA data:
git clone https://huggingface.co/datasets/orionweller/according-to-data pip install git+https://github.com/facebookresearch/KILT.git
-
For QUIP usage, follow the setup instructions in the Data Portraits repository.
Optional: Regenerate KILT data
- Install MongoDB
- Clone the KILT repository:
git clone https://github.com/facebookresearch/KILT.git
- Follow the KILT README to download and prepare the data
- Start the MongoDB server and load all documents
- Run our parser:
python src/parse_kilt_files.py
Optional: Regenerate PubMed data
- Clone the PubMedQA repository:
git clone https://github.com/pubmedqa/pubmedqa.git
- Follow their README to split the dataset
- Use the resulting
pubmedqa.json
file for further processing
QUIP Setup
- Ensure you have recent versions of
cmake
andgcc
- Clone the Data Portraits repository
- Install Redis:
bash install_redis.sh
- Install the package:
pip install -e .
- Start Redis:
python easy_redis.py --just-start
-
Set your OpenAI API key (if using OpenAI models):
export OPENAI_API_KEY='your-api-key-here'
-
Generate the configuration file:
python src/create_experiments_config.py -c configs/chatgpt_pubmed.jsonl -p prompts/prompts_pubmed.jsonl -o to_run.jsonl --debug
-
Run the experiments:
./bin/run_batch.sh configs/to_run.jsonl
Results will be saved in the results
directory with timestamps.
Note: Check the configs/
folder for different configuration examples. Prompts used in the paper are in the prompts/
directory.
Access model generations at orionweller/according-to-generations, organized by model, dataset, and prompt.
This project is licensed under the MIT License - see the LICENSE file for details.
If you find our work useful in your research, please consider citing:
@inproceedings{weller-etal-2024-according,
title = "{``}According to . . . {''}: Prompting Language Models Improves Quoting from Pre-Training Data",
author = "Weller, Orion and
Marone, Marc and
Weir, Nathaniel and
Lawrie, Dawn and
Khashabi, Daniel and
Van Durme, Benjamin",
booktitle = "Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = mar,
year = "2024",
address = "St. Julian{'}s, Malta",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.eacl-long.140",
pages = "2288--2301",
}