Skip to content

KempnerInstitute/minOLMo

Repository files navigation

minOLMo - An Explainable Open Language Model

minolmo has been forked from the original OLMo model, with the primary goal of removing extra complexity and distributed training capabilities. This streamlined version is now a large language model featuring a simplified codebase that is easy to understand and follow. It is designed to be run and explored by researchers on a single GPU, making it accessible for those who want to delve into the workings of large language models without the need for extensive computational resources.

Fork History

  • Apr 5, 2024: David Brandfonbrener forked the original OLMo repository to create the min-olmo repository. David removed the distributed training capabilities.
  • Apr 22, 2024: Kempner Institute forked the min-olmo repository from David's fork to create the KempnerInstitute/min-olmo repository. Any code after this date is from Kempner Institute affiliated contributors.

Package Structure

The project includes two main categories of files and directories:

  • minOLMo python package: This package contains the source code for the model.
  • Scripts, configs, and other helper files to run the model.

In the following, we provide a brief description of the main directories and files in the project:

  • configs: This directory contains the configuration files for the model. The configuration files include the model parameters, input data, and other necessary parameters to train the model.
  • docs: This directory contains the documentation for the project.
    • Single documentation files can be added using a single markdown file.
    • The PDF technical report is located in docs/technical_report directory.
    • Documentations that my need extra files (e.g., images) can be added in a separate directory.
  • minolmo: This directory contains the source code for the model.
  • scripts: This directory contains the scripts to run the model.
  • notebooks: This directory contains the notebooks to explore the model.
  • tests: This directory contains the tests for the model.
  • CHANGELOG.md: This file contains the changes made to the project.
  • README.md: The current file.
  • LICENSE: The license file for the project.
  • pyproject.toml: The configuration file for building the python package.

Installation

To install the package:

  • Step 1: Clone the repository

    • For developers:

      git clone git@github.com:KempnerInstitute/minOLMo.git
    • For users:

      git clone https://github.com/KempnerInstitute/minOLMo.git
  • Step 2: Create a conda environment

  • Step 3: Load Modules

module load python/3.12.5-fasrc01
module load cuda/12.4.1-fasrc01
module load cudnn/8.9.2.26_cuda12-fasrc01
  • Step 4: Install the package
pip install -r requirements.txt
pip install -e .

Running the model

To run the model, you can use the provided scripts in the scripts directory. Before running the model, you need to have the following:

  • Binary numpy files for the training data.
  • Binary numpy files for the validation data.
  • You Weights and Biases entity for logging the training process (in case you want to use it).
  • Run name.
  • Save folder.
  • Configuration file.
    • Path to the input training data folder.
    • Path to the input validation data folder.
    • W & B entity.

Interactive Session

After you have all the necessary files and configurations, you can run the model using the following command:

Firs you need to allocate a compute node:

salloc -p kempner_h100 --account=[your account] --nodes=1 --ntasks=1 --cpus-per-task=24 --mem=375G --gres=gpu:1 -t 00-12:00:00

Then you can run the model:

python scripts/train.py configs/base-c4-t5.yaml  --run_name=olmo --save_folder=save_folder

Batch Job

For submitting a batch job, you can use the run_single_gpu.sh script in the scripts directory. The script will submit a batch job to the SLURM scheduler. You can modify the script based on your requirements.

sbatch scripts/run_single_gpu.sh