-
Notifications
You must be signed in to change notification settings - Fork 11.4k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
tts : implement sesame CSM + Mimi decoder #12648
base: master
Are you sure you want to change the base?
Conversation
Really nice! I'm having some issues with longer sentences, or is that just the model's limitations?
Works, but:
Will go in an infinite loop of token generation. |
I think my implementation still have some problems, but not sure where. I never get logits to 100% match what the safetensors model generates. Will reach out to Sesame team to confirm if I'm doing this correctly |
Could we not fix the vocab size when creating the models? |
Hmm ok I see what you mean, I was looking at But turns out I can just make my own |
I added top-k 50 and temp 0.9 sampling , these values are taken from python code. It does work better, but still in some cases it struggles with long text. I think because they also train the model to have audio and text tokens interleaved, but I still haven't found the python code. I only found this on their website:
|
Does the Python implementation also struggle with that? If not, then it might indicate a bug in the ggml implementation. |
With this text:
The llama.cpp version finishes generation after about 800 codes, the result is: output.3.mp4On my local macbook, with audio_000.mp4On HF space, the generate seems fine, though it gets cur off after 30s (I think it's limited so the zero GPU timeout is not reached) audio.2.mp4So I think both llama.cpp and |
Ok so after confirmed with the sesame team, the problem was that I misidentify the bug. I thought that the summation in the "squash" step is 31 acoustic embeddings, but it is actually sum of all 32 embeddings. The reason why sum of 32 doesn't work earlier for me was because I used greedy sampling. Now with both sum of 32 + topK/temp sampling implemented, it works like magic! (Note: the silence added in the end was due to the conversion to mp4 ; the original file doesn't have that) output.4.mp4Sesame team also confirm to me that the input text and audio will be interleaved by turn: @ggerganov One thing I'm also thinking about, the decoder model is very small so I think it could be faster if do a "batch generation", meaning the whole decoder cgraph can be run 32 times without synchronization. This is indeed what they did on the python implementation. The key is to have a sampling function that can run on cgraph. Currently, the llama.cpp impl can do 300 t/s on my macbook, but I believe that with this "batch generation" can allow at least 600 t/s. Could be something fun to try after this PR is merged. WDYT? |
Btw, you can convert very easily with ffmpeg -i output.wav output.mp4 |
Yes, GPU sampling should be supported eventually in Lines 1180 to 1183 in 2bb3597
|
Ok so I added support for multi-turns text input, but the generated audio has a silence gap between 2 turns. I observed kinda same thing on the python demo, so I think it's something to do with the model. |
I am doing some testing and I think what is confusing it is the new lines in the input. If I remove the new lines, it seems to work better: csm-demo.txt
Maybe double-check that the tokenization is correct, compared to the HF space demo? |
I have a deeper look into the code of HF demo space. Seems like for each turn, they re-evaluate the whole "chat" history: https://huggingface.co/spaces/sesame/csm-1b/blob/main/app.py#L150-L156 But that does not change much though. What I understand is that this is the same thing with text chat templates. The only difference is that in this case, with audio embd, our chat template looks like this:
So seems like we're just missing What I'm speculating this that we're also missing the "system prompt" (same idea with speaker in outeTTS), which shows the model how the voice should behave. In the official demo, they have a casual voice and a more serious news reader voice. I'll give it a try. |
@ngxson is it be possible to have audio streaming? |
We don't support streaming for simplification. It can be added in the future when the implementation become more stable. |
@ngxson Appreciate the through work here. |
@ngxson Should I review or wait for the:
|
@ggerganov please go ahead and review this PR. The system prompt will be simple to add, I will try to do that a bit later (that requires me to use the Mimi encoder via transformers) |
@ggerganov I added the speaker reference and it works well. You were right about the new line stuff, the model is very sensitive to newline characters, it usually add a long pause in place of the newline. output.mp4 |
Related to #12392
Tbh it is more complicated than expected.
This PR only contains the backbone + decoder:
How to try this?
By default, all GGUF files are downloaded from ggml-org Hugging Face's account
Alternatively, GGUF files can be converted using
convert_mimi_to_gguf.py
andconvert_csm_to_gguf.py
underexample/tts
directory. These script usestransformers.AutoModel
under the hood, so they will also handle downloading safetensors file automatically.Note: it pronounces "Xuan" incorrectly, but the rest is OK
output.mp4
How Sesame CSM works?
The model contains a backbone and a decoder, both are based on llama 3.x architecture (auto-aggressive).