Table of Contents
SMIT is a versatile tool designed to streamline the integration of audio modality into your LLMs. Currently, SMIT exclusively supports audio as a new modality. However, our goal is to expand its capabilities to accommodate any new modality seamlessly. We welcome contributions from the open-source community to help us achieve this aim.
Welcome to SMIT! Follow these simple steps to get started:
Begin by cloning the SMIT repository to your local machine using Git:
git clone https://github.com/Thytu/SMIT/
cd SMIT
We highly recommend using a virtual environment to manage dependencies and prevent conflicts. Create and activate a virtual environment using your preferred tool (e.g., virtualenv, conda):
# Example using virtualenv
virtualenv venv
source venv/bin/activate
Once inside the project directory and your virtual environment is activated, install the required dependencies listed in requirements.txt using pip:
pip install -r requirements.txt
You can quickly run the default example provided in SMI by executing the following command:
python src/main.py
This will train the amazing abacaj/phi-2-super model to do ASR using the librispeech_asr
dataset and facebook/hubert-large-ls960-ft as speech encoder, reproducing the Thytu/phi-2-audio-super model.
Important
It's essential to ensure a minimum of 30GB of available VRAM to execute this command successfully. For users with >=80GB of VRAM, it's recommended to deactivate quantization while decreasing the batch size to expedite the training process. You can achieve this by running:
python src/main.py ~model.decoder.quantization_config ++training.training_args.per_device_train_batch_size=1
To customize your own Language Model (LLM), create a configuration file. You can use the provided config file template as a starting point. Then, use Hydra syntax to provide your configuration file:
python src/main.py model=my_config
Hydra offers extensive options for parameter overriding, allowing you to tailor the model according to your specific requirements. Refer to Hydra documentation for more details on customization options.
Once your model is trained, you can effortlessly load it for inference:
model = SMIT.from_pretrained("path_to_your_safetensor")
For inference tasks, you can utilize the generate
method:
model.generate("Tell me how to add a modality to my model")
To employ the generate
method with multiple modalities, follow this approach:
model.generate(
prompt=[
"Tell me how to add a modality to my model",
"Transcribe this audio from speech to text {audio}",
],
raw_speech=[None, you_audio],
)
Note
When providing multiple prompts, ensure that the length of raw_speech
matches the length of prompt
.
SMIT simplifies the process of enhancing your LLM with audio capabilities, following the principles outlined in the this paper. By linking a speech encoder to an decoder using a trainable linear projector adding to your LLM the audio modality. SLMA automates the integration process by making it as easy as configuring a single file.
To use SMIT, simply define your desired configurations in the provided config file, it will then handle the rest, seamlessly incorporating the audio modality into your models.
There are mutliple ways to contribute to that projects, either regarding the UX (i.e doc / even making the example faster) or regarding the core product itself (i.e handling Vision modality). Any contributions you make are greatly appreciated, if you have a suggestion that would make this better feel free to tell me :D You can also check the open issues for more things to improve.
Don't forget to give the project a star! 🌟 Thanks again!
This project draws significant inspiration from the An Embarrassingly Simple Approach for LLM with Strong ASR Capacity paper. I thank the authors for sharing their expertise. Huge thanks to the CoolKids for their help in debugging some pesky issues I ran into. And last but definitely not the least, a massive thank you to Oursin – this project simply wouldn't exist without you!
Hey, I'm Valentin De Matos, passionate about AI and always working on some new side project.
You can reach me out at vltn.dematos@gmail.com and if you want more information you can always