Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

tts : implement sesame CSM + Mimi decoder #12648

Open
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Mar 29, 2025

Related to #12392

Tbh it is more complicated than expected.

This PR only contains the backbone + decoder:

How to try this?

By default, all GGUF files are downloaded from ggml-org Hugging Face's account

# build (make sure to have LLAMA_CURL enabled)
cmake -B build -DLLAMA_CURL=ON
cmake --build build -j --target llama-tts-csm

# run it
./build/bin/llama-tts-csm -p "[0]Hi, my name is Xuan Son. I am software engineer at Hugging Face."

Alternatively, GGUF files can be converted using convert_mimi_to_gguf.py and convert_csm_to_gguf.py under example/tts directory. These script uses transformers.AutoModel under the hood, so they will also handle downloading safetensors file automatically.

Note: it pronounces "Xuan" incorrectly, but the rest is OK

output.mp4

How Sesame CSM works?

The model contains a backbone and a decoder, both are based on llama 3.x architecture (auto-aggressive).

  1. The input text will firstly be processed by backbone, the output is (1) a RVQ semantic code and (2) the raw embedding from last layer, after norm
  2. These 2 output from backbone then get passed into decoder as input. The decoder then generate the next 31 RVQ acoustic tokens
  3. At this point, 32 RVQ are generated, it then get "squash" back into one single vector, then pass back the the backbone
  4. Repeat from step 1 to generate the next codes
flowchart TD
    A[Input Text, vocab 128_256 tokens] -- prompt input --> B

    subgraph Backbone
        B[Backbone transformer]
        B --> C[Output logits, vocab 65632 tokens]
        B --> D[Output Raw embd, vector of 2048 elem]
    end

    D -- vector input --> Proj
    C -- sampling --> Stoken[RVQ semantic token]
    Stoken --> Fin
    Stoken --> H

    subgraph Decoder
        Proj[Projector, reduce size to 1024]
        Fin[Input vocab: 65632 tokens] -- vector dim 2048 --> Proj
        Proj --> F[Decoder transformer]
        F --> G[Output logits: vocab 2051 tokens]
    end

    G -- sampling --> HH[RVQ acoustic token]
    HH -- generate next token --> Fin
    HH -- repeated 31 times --> H[Collected 32 RVQ tokens & audio embeddings, matrix: 2048 x 32]

    H -- sum all vectors --> I[single vector of 2048]
    I -- generate next token --> B
    I -- is zero vec? --> K[Stop generation]

Loading

@github-actions github-actions bot added examples python python script changes labels Mar 29, 2025
@ngxson ngxson mentioned this pull request Mar 30, 2025
4 tasks
@ngxson ngxson changed the title tts : implement sesame backbone + decoder tts : implement sesame CSM + Mimi decoder Mar 30, 2025
@ngxson ngxson marked this pull request as ready for review March 30, 2025 12:30
@arch-btw
Copy link
Contributor

Really nice!

I'm having some issues with longer sentences, or is that just the model's limitations?
For example:

-p "[0]Hi! How are you? I hope you"

Works, but:

-p "[0]Hi! How are you? I hope you are doing well"

Will go in an infinite loop of token generation.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 30, 2025

I think my implementation still have some problems, but not sure where. I never get logits to 100% match what the safetensors model generates.

Will reach out to Sesame team to confirm if I'm doing this correctly

@ggerganov
Copy link
Member

The problem is that output vocab size is always smaller than the model's defined vocab size.

Could we not fix the vocab size when creating the models?

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 31, 2025

Hmm ok I see what you mean, I was looking at llama_sampler_sample and was thinking that it has a fixed n_vocab taken from llama_context

But turns out I can just make my own llama_sampler_sample with a different n_vocab, will give it a try!

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 31, 2025

I added top-k 50 and temp 0.9 sampling , these values are taken from python code. It does work better, but still in some cases it struggles with long text.

I think because they also train the model to have audio and text tokens interleaved, but I still haven't found the python code. I only found this on their website:

Training samples are structured as alternating interleaved patterns of text and audio, with speaker identity encoded directly in the text representation.

@ggerganov
Copy link
Member

Does the Python implementation also struggle with that? If not, then it might indicate a bug in the ggml implementation.

@ngxson
Copy link
Collaborator Author

ngxson commented Mar 31, 2025

With this text:

[0]How do we know when someone truly understands us? It is rarely just our words—it is in the subtleties of voice: the rising excitement, the thoughtful pause, the warm reassurance. Voice is our most intimate medium as humans, carrying layers of meaning through countless variations in tone, pitch, rhythm, and emotion. Today's digital voice assistants lack essential qualities to make them truly useful. Without unlocking the full power of voice, they cannot hope to effectively collaborate with us. A personal assistant who speaks only in a neutral tone has difficulty finding a permanent place in our daily lives after the initial novelty wears off.

The llama.cpp version finishes generation after about 800 codes, the result is:

output.3.mp4

On my local macbook, with mlx-audio, the output is cut off after ~10s

audio_000.mp4

On HF space, the generate seems fine, though it gets cur off after 30s (I think it's limited so the zero GPU timeout is not reached)

audio.2.mp4

So I think both llama.cpp and mlx-audio implementations are missing something. I don't yet have time to try the official python code (given that it's only runnable on nvidia GPU). If someone has time & the GPU, could you please try?

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 1, 2025

Ok so after confirmed with the sesame team, the problem was that I misidentify the bug. I thought that the summation in the "squash" step is 31 acoustic embeddings, but it is actually sum of all 32 embeddings. The reason why sum of 32 doesn't work earlier for me was because I used greedy sampling.

Now with both sum of 32 + topK/temp sampling implemented, it works like magic!

(Note: the silence added in the end was due to the conversion to mp4 ; the original file doesn't have that)

output.4.mp4

Sesame team also confirm to me that the input text and audio will be interleaved by turn: <text_utt1><audio_utt1><text_utt2><audio_utt2>...<text_uttN><audio_uttN>, should be easy to implement, will do that today.

@ggerganov One thing I'm also thinking about, the decoder model is very small so I think it could be faster if do a "batch generation", meaning the whole decoder cgraph can be run 32 times without synchronization. This is indeed what they did on the python implementation. The key is to have a sampling function that can run on cgraph. Currently, the llama.cpp impl can do 300 t/s on my macbook, but I believe that with this "batch generation" can allow at least 600 t/s. Could be something fun to try after this PR is merged. WDYT?

@ggerganov
Copy link
Member

(Note: the silence added in the end was due to the conversion to mp4 ; the original file doesn't have that)

Btw, you can convert very easily with ffmpeg and it won't have silence:

ffmpeg -i output.wav output.mp4

@ggerganov
Copy link
Member

@ggerganov One thing I'm also thinking about, the decoder model is very small so I think it could be faster if do a "batch generation", meaning the whole decoder cgraph can be run 32 times without synchronization. This is indeed what they did on the python implementation. The key is to have a sampling function that can run on cgraph. Currently, the llama.cpp impl can do 300 t/s on my macbook, but I believe that with this "batch generation" can allow at least 600 t/s. Could be something fun to try after this PR is merged. WDYT?

Yes, GPU sampling should be supported eventually in libllama. The samplers would need to implement a call that appends nodes to an existing cgraph:

llama.cpp/include/llama.h

Lines 1180 to 1183 in 2bb3597

// TODO: API for internal libllama usage for appending the sampling to an existing ggml_cgraph
//void (*apply_ggml) (struct llama_sampler * smpl, ...);
};

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 2, 2025

Ok so I added support for multi-turns text input, but the generated audio has a silence gap between 2 turns.

I observed kinda same thing on the python demo, so I think it's something to do with the model.

@ngxson ngxson requested a review from ggerganov April 2, 2025 15:33
@ggerganov
Copy link
Member

but the generated audio has a silence gap between 2 turns.

I am doing some testing and I think what is confusing it is the new lines in the input. If I remove the new lines, it seems to work better:

csm-demo.txt

[0]Hey how are you doing.[1]Pretty good, pretty good.[0]I'm great, so happy to be speaking to you. What about you?[1]Me too, this is some cool stuff huh?

Maybe double-check that the tokenization is correct, compared to the HF space demo?

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 3, 2025

I have a deeper look into the code of HF demo space. Seems like for each turn, they re-evaluate the whole "chat" history: https://huggingface.co/spaces/sesame/csm-1b/blob/main/app.py#L150-L156

But that does not change much though. What I understand is that this is the same thing with text chat templates. The only difference is that in this case, with audio embd, our chat template looks like this:

<bos> ... text1 ... <text_eos> ... audio_embd ... <audio_eos><bos> ... text12... <text_eos> ... audio_embd ... <audio_eos> ...

So seems like we're just missing <audio_eos>, I added it in my last commit but it does not change much. The only difference so far was that now it's able to generate male/female voice for separated turn (which it was unable to do beforehand)

What I'm speculating this that we're also missing the "system prompt" (same idea with speaker in outeTTS), which shows the model how the voice should behave. In the official demo, they have a casual voice and a more serious news reader voice. I'll give it a try.

@Desir-Armann
Copy link

@ngxson is it be possible to have audio streaming?

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 5, 2025

We don't support streaming for simplification. It can be added in the future when the implementation become more stable.

@ShaanveerS
Copy link

@ngxson Appreciate the through work here.
You mentioned that streaming could be added once things stabilize... would you be open to briefly describe what steps or components would be involved to support it?
Thanks a lot.

@ggerganov
Copy link
Member

@ngxson Should I review or wait for the:

What I'm speculating this that we're also missing the "system prompt" (same idea with speaker in outeTTS), which shows the model how the voice should behave. In the official demo, they have a casual voice and a more serious news reader voice. I'll give it a try.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 8, 2025

@ggerganov please go ahead and review this PR. The system prompt will be simple to add, I will try to do that a bit later (that requires me to use the Mimi encoder via transformers)

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 9, 2025

@ggerganov I added the speaker reference and it works well. You were right about the new line stuff, the model is very sensitive to newline characters, it usually add a long pause in place of the newline.

output.mp4

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants