Evaluation of Large Language Models via Coupled Token Generation

This repository contains the code and data for the paper "Evaluation of Large Language Models via Coupled Token Generation" by Nina Corvelo Benz, Stratis Tsirtsis, Eleni Straitouri, Ivi Chatzi, Ander Artola Velasco, Suhas Thejaswi, and Manuel Gomez-Rodriguez.

Dependencies

All the experiments were performed using Python 3.11. In order to create a virtual environment and install the project dependencies you can run the following commands:

python3 -m venv env
source env/bin/activate
pip install -r requirements.txt

Code organization

The directory data contains the data used for the experiments.

The directory models contains the list of models used.

The directory src contains the source code for the experiments.

The directory scripts contains bash scripts that use code under src to run the experiments.

The directory notebooks contains jupyter notebooks producing the figures appearing in the paper.

The directory figures is used for saving the figures produced by the notebooks.

The directory outputs is used for saving the outputs produced by the scripts.

Instructions

Downloading the models

Our experiments use LLMs from the Llama family. Llama is a "gated" model, that is, it requires licensing to use. You can request to access it at: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct. Once you have access, you can download any model in the Llama family. Then, before running the scripts you need to authenticate with your Hugging Face account by running huggingface-cli login in the terminal. Each model will be downloaded to the models folder the first time it is called from a script.

Setting up

Run python3 src/merge_tokenizers.py before running the scripts to set up the joint vocabulary.

MMLU experiment

The final output files of the experiment are provided in the outputs/mmlu directory. To reproduce the figures in the paper, you only need to run the mmlu.ipynb notebook.

The script mmlu.sh produces the outputs of one LLM, using one seed, given the questions from the MMLU dataset as input prompts. To reproduce all the outputs, run the script twice for each model (for independent and coupled generation), using the seeds provided in the script.

LMSYS experiment

The final output files of the experiment are provided in the outputs/LMSYS directory. To reproduce the figures in the paper, you only need to run the lmsys.ipynb notebook.

The script lmsys.sh produces the outputs of one LLM, using one seed, to the questions from the dataset in data/processed/LMSYS/questions.json. To reproduce all the outputs, run the script twice for each model (for independent and coupled generation), using the seeds provided in the script. The results of the pairwise comparisons of these outputs by GPT-4o-2024-11-20 are provided in the outputs/LMSYS directory.

Contact & attribution

In case you have questions about the code, you identify potential bugs or you would like us to include additional functionalities, feel free to open an issue or contact Ivi Chatzi or Eleni Straitouri.

If you use parts of the code in this repository for your own research, please consider citing:

@article{benz2025evaluation,
  title={Evaluation of Large Language Models via Coupled Token Generation}, 
  author={Nina Corvelo Benz and Stratis Tsirtsis and Eleni Straitouri and Ivi Chatzi and Ander Artola Velasco and Suhas Thejaswi and Manuel Gomez-Rodriguez},
  year={2025},
  journal={arXiv preprint arXiv:2502.01754}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluation of Large Language Models via Coupled Token Generation

Dependencies

Code organization

Instructions

Downloading the models

Setting up

MMLU experiment

LMSYS experiment

Contact & attribution

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data/processed/LMSYS		data/processed/LMSYS
models		models
notebooks		notebooks
outputs		outputs
scripts		scripts
src		src
README.md		README.md
requirements.txt		requirements.txt

Networks-Learning/coupled-llm-evaluation

Folders and files

Latest commit

History

Repository files navigation

Evaluation of Large Language Models via Coupled Token Generation

Dependencies

Code organization

Instructions

Downloading the models

Setting up

MMLU experiment

LMSYS experiment

Contact & attribution

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages