Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

(wip) support ultravox audio input #12745

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Apr 3, 2025

Current status: inference runs, but output gibberish

Original model: https://huggingface.co/fixie-ai/ultravox-v0_5-llama-3_2-1b

Why I do this?

Because ultravox seems to be a quite low-hanging fruit:

  • It uses whisper encoder, so indeed a whole lot of code in this PR is copied from whisper.cpp
  • It uses 2 matrices MLP to project from audio embd to text embd --> vision models already doing this
  • It uses vanilla llama 3.2 1B model without any fine-tuning

Application of this can we quite useful. Take an example of an app that can summarize a meeting based on audio:

  • Traditional audio processing pipeline is: audio --> text --> summary. Many acoustic features are lost in the audio --> text translation
  • With multimodal input, the pipeline will be: audio --> summary, a lot less latency and also all audio features are retained, including pauses, music, tone, pitch, etc

@github-actions github-actions bot added examples python python script changes labels Apr 3, 2025
@johnbenac
Copy link

Do you think that this will work with the CSM GGUF pull request that you already implemented?

#12648

That was missing the encoder. I'm starting to work on the encoder (even though I really have no idea what I'm doing) but was this pull request an attempt to get encoding suitable for the GGUF CSM model?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants