-
Notifications
You must be signed in to change notification settings - Fork 12.2k
added rudimentary support for outetts v0.3 500m and 1b models #11287
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
base: master
Are you sure you want to change the base?
Conversation
Yeah, that’s why in the library I grouped them before and after words, it might not be the best solution, but it works: Input: Converts to:
|
Yeah, but even if the TTC part works, I think the CTS part might fail. I can definitely do that if you think it's better. |
@@ -371,7 +371,7 @@ static std::string replace_numbers_with_words(const std::string & input_text) { | |||
} | |||
|
|||
// Based on: https://github.com/edwko/OuteTTS/blob/a613e79c489d8256dd657ea9168d78de75895d82/outetts/version/v1/prompt_processor.py#L39 | |||
static std::string process_text(const std::string & text) { | |||
static std::string process_text(const std::string & text, bool is_version_0_3) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw to check if the version is 0.3, you can use:
bool is_version_0_3 = common_get_builtin_chat_template(model) == "outetts-0.3"
@edwko I planned to add this as a dedicated GGUF meta key, but turns out I still not have the time to implement this. I'll try to do this in next week! And btw congrats for the release of v0.3 😄
@edwko how is this case currently handled for you:
I had issues when encountering fragments with only spaces and punctuations but no readable text. The narration breaks down once that is encountered |
@LostRuins All punctuations are merged to the closest word in cases like this
Speech generation works fine if you follow this format. I just tested both |
I noticed you removed the inbetween spaces. Whats the rules for that? The naive approach would generate
|
It processes the text like this: Here’s the implementation for this: When constructing the words back to create the audio prompt, it joins the punctuation like this: word = s["word"]
if i.get("before", []):
word = "".join(i["before"]) + word
if i.get("after", []):
word += "".join(i["after"]) |
Yeah, anyway this is exactly what I meant by the various edge cases that may need to be untangled regarding punctuation, which is why I initially excluded it. Perhaps we can consider starting with this, and then expanding the implementation? Happy for someone to improve upon it here, either before or after merging. |
recommended way to check if the version is 0.3, as requested by ngxson
@ngxson @ggerganov are there any problems with this? |
I am looking towards adding more general-purpose TTS support, so don't want to spend too much effort on this example. It's main purpose was to demonstrate a possible TTS implementation. |
OuteTTS v0.3 is as must have for this example |
This is currently broken when running:
But works fine with OuteAI/OuteTTS-0.2-500M-GGUF, any thoughts? |
@cjcox17 I think it's because v0.3 adds |
The chat template logic recently changed quite a lot after the introduction of jinja engine, I'll have a look later |
Hi @ggerganov @edwko
This PR adds rudimentary support for the newly released OuteTTS v0.3 500m and 1b models, found at https://huggingface.co/OuteAI/OuteTTS-0.3-500M-GGUF and https://huggingface.co/OuteAI/OuteTTS-0.3-1B-GGUF
This will allow loading and generating with the new models, although crucially it ignores the new punctuation tokens. I had previously added them in my own fork, but they come with a lot of edge cases that may not be so easy to untangle, since they are grouped with other tokens and there are degenerate cases (e.g.
www..!...google....com??
) that will cause problems if they are simply swapped in as is.The model types are differentiated by attempting to tokenize
<|space|>
, which is a single token in v0.3, but not in earlier versions. For the 1B model, the token<|0|>
has a different offset, thus it's been changed to be determined dynamically. The existing speaker voice is retained, but I swapped out your hardcoded token array with a runtime tokenization for the same reasons (and also adapting the v0.3 format)Remains compatible with v0.2 and should be able to load all 3 model types.
It is actually ready to merge as-is, but feel free to make whatever changes you deem necessary. Cheers!