(wip) support ultravox audio input #12745

ngxson · 2025-04-03T22:57:50Z

Current status: inference runs, but output gibberish

Original model: https://huggingface.co/fixie-ai/ultravox-v0_5-llama-3_2-1b

Why I do this?

Because ultravox seems to be a quite low-hanging fruit:

It uses whisper encoder, so indeed a whole lot of code in this PR is copied from whisper.cpp
It uses 2 matrices MLP to project from audio embd to text embd --> vision models already doing this
It uses vanilla llama 3.2 1B model without any fine-tuning

Application of this can we quite useful. Take an example of an app that can summarize a meeting based on audio:

Traditional audio processing pipeline is: audio --> text --> summary. Many acoustic features are lost in the audio --> text translation
With multimodal input, the pipeline will be: audio --> summary, a lot less latency and also all audio features are retained, including pauses, music, tone, pitch, etc

johnbenac · 2025-04-06T02:52:49Z

Do you think that this will work with the CSM GGUF pull request that you already implemented?

#12648

That was missing the encoder. I'm starting to work on the encoder (even though I really have no idea what I'm doing) but was this pull request an attempt to get encoding suitable for the GGUF CSM model?

Og4rek · 2025-04-22T09:14:04Z

Hi @ngxson

Could you share the current status of this PR?
Is work on Ultravox audio input still moving forward, or has it been paused because the integration proved too difficult?

Thanks in advance!

ngxson added 2 commits April 3, 2025 16:11

(wip) convert ultravox-enc to gguf

62695aa

output but wrong

d44c721

github-actions bot added examples python python script changes labels Apr 3, 2025

ngxson mentioned this pull request Apr 4, 2025

llama : add llama_batch_ext #11875

Open

add conv layer

49193e2

ngxson mentioned this pull request Apr 9, 2025

server: Bring back multimodal support #8010

Closed

18 tasks

Merge branch 'master' into xsn/ultravox

67150e0

ngxson mentioned this pull request May 18, 2025

mtmd : add ultravox audio input #13623

Merged

ngxson closed this May 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(wip) support ultravox audio input #12745

(wip) support ultravox audio input #12745

Uh oh!

ngxson commented Apr 3, 2025 •

edited

Loading

Uh oh!

johnbenac commented Apr 6, 2025

Uh oh!

Og4rek commented Apr 22, 2025

Uh oh!

Uh oh!

(wip) support ultravox audio input #12745

(wip) support ultravox audio input #12745

Uh oh!

Conversation

ngxson commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why I do this?

Uh oh!

johnbenac commented Apr 6, 2025

Uh oh!

Og4rek commented Apr 22, 2025

Uh oh!

Uh oh!

ngxson commented Apr 3, 2025 •

edited

Loading