Skip to content

chenlicodebank/lora_on_analog_hardware

Repository files navigation

LoRA on analog hardware

This repository provides code for reproducing the Efficient Deployment of Transformer Models in Analog In-Memory Computing Hardware results.


Table of Contents


Installation

  1. Clone the Repository:

    conda create -n analog_lora python=3.9
    conda activate analog_lora
    git clone https://github.com/chenlicodebank/lora_on_analog_hardware.git
    cd lora_on_analog_hardware
    
  2. Install Dependencies:

    pip install -r requirements.txt
    
  3. Install AIHWKIT:

    If aihwkit doesn't work, use advanced installation.


Hardware-Aware Training on SQuAD

This section provides a straightforward application of Hardware-Aware Training with MobileBERT on the SQuAD v1.1 dataset.

Example:

cd hardware_aware_training
export SQUAD_DIR=pwd/data
python run_qa.py \
  --model_name_or_path csarron/mobilebert-uncased-squad-v1 --dataset_name squad \
  --do_train \
  --report_to wandb \
  --logging_steps 100 \
  --do_eval \
  --save_strategy epoch \
  --per_device_train_batch_size 32 \
  --per_device_eval_batch_size 128 \
  --weight_decay 0.0001 \
  --num_train_epochs 15 \
  --max_seq_length 320 \
  --evaluation_strategy epoch \
  --doc_stride 128 \
  --warmup_steps 0 \
  --output_dir ./squad_models_train/ \
  --pcm_model PCM_Gmax25 \
  --output_noise_level 0.04 \
  --analog_optimizer AnalogAdam \
  --analog_lr 0.00005 \
  --num_evaluation_drift_values 7 \
  --num_evaluation_repetition 10

Hardware-Aware LoRA Training on SQuAD

This section provides an application of the Hardware-Aware LoRA Training with MobileBERT on the SQuAD v1.1 dataset. In traditional Hardware-Aware Training, the process typically involves two steps: fine-tuning the model in full precision, and then using the fine-tuned model for hardware-aware training. However, in Hardware-Aware LoRA Training, we skip the full precision fine-tuning step and directly use the pretrained model. Specifically, we use --model_name_or_path google/mobilebert-uncased instead of --model_name_or_path csarron/mobilebert-uncased-squad-v1. This approach retains the flexibility for post-deployment adaptations to new tasks (e.g., GLUE) and hardware configurations (e.g., ADC bit settings), see paper for details.

Example:

cd lora_training
export SQUAD_DIR=pwd/data
export SQUAD_DIR=pwd/data
python run_qa.py \
   --model_name_or_path google/mobilebert-uncased --dataset_name squad \
   --do_train \
   --report_to wandb \
   --logging_steps 100 \
   --do_eval \
   --save_strategy epoch \
   --per_device_train_batch_size 32 \
   --per_device_eval_batch_size 128 \
   --weight_decay 0.0001 \
   --num_train_epochs 15 \
   --max_seq_length 320 \
   --evaluation_strategy epoch \
   --doc_stride 128 \
   --warmup_steps 0 \
   --output_dir ./squad_models_train/ \
   --pcm_model PCM_Gmax25 \
   --output_noise_level 0.04 \
   --analog_optimizer AnalogAdam \
   --analog_lr 0.0002 \
   --num_evaluation_drift_values 7 \
   --num_evaluation_repetition 10

Hardware-Aware LoRA Training on GLUE

This section provides an application of the Hardware-Aware LoRA Training with MobileBERT on GLUE. The example is on CoLA, check shell scripts in lora_training_glue for other GLUE subtasks.

Example:

cd lora_training_glue
export TASK_NAME=cola
export EXP_INDEX=1
python run_glue.py \
  --model_name_or_path google/mobilebert-uncased \
  --task_name $TASK_NAME \
  --ignore_mismatched_sizes \
  --report_to wandb \
  --logging_steps 100 \
  --do_train \
  --do_eval \
  --max_seq_length 128 \
  --per_device_train_batch_size 32 \
  --learning_rate 2e-4 \
  --num_train_epochs 15 \
  --output_dir ./results/$TASK_NAME/$EXP_INDEX \
  --pcm_model PCM_Gmax25 \
  --output_noise_level 0.04 \
  --analog_optimizer AnalogAdam \
  --analog_lr 0.0002 \
  --num_evaluation_drift_values 7 \
  --num_evaluation_repetition 10

Scaling

The evaluated model is MobileBERT as its parameters (25.3M) can fit on modern analog chips. The proposed method can be applied to other models, simply by specifying --model_name_or_path to the desired model in the above scripts. The results on BERT_BASE (110M) and BERT_LARGE (340M) can be found on paper.


Better LoRA and AIHWKIT settings

We employ naive LoRA to keep the implementation simple and establish baseline results. Leveraging more advanced LoRA variants and better LoRA hyperparameters have the potential to achieve superior performance compared to the results presented in our paper.

Additionally, the final performance is influenced by the settings in AIHWKIT. Most tunable parameters are configurable through the gen_rpu_config function.

The training hyperparameters such as the learning rate and the number of epochs, are selected using a heuristic approach. Further fine-tuning of these hyperparameters may enhance performance.


Citation

If you use this repository in your research or project, please consider citing it using the following format:

@article{li2024efficient,
  title={Efficient Deployment of Transformer Models in Analog In-Memory Computing Hardware},
  author={Li, Chen and Lammie, Corey and Gallo, Manuel Le and Rajendran, Bipin},
  journal={arXiv preprint arXiv:2411.17367},
  year={2024}
}

About

Hardware-Aware LoRA Training

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published