Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Voice Experience Issue in Google Text to Speech (streaming_synthesize) #13405

Closed
AKSHILMY opened this issue Jan 7, 2025 · 3 comments
Closed
Assignees
Labels
needs more info This issue needs more information from the customer to proceed. type: question Request for information or clarification. Not an issue.

Comments

@AKSHILMY
Copy link

AKSHILMY commented Jan 7, 2025

The use of Journey voices in normal TTS gives an engaging voice that attracts a user.
But when I use the same Journey voice in a streaming text input -> streaming audio output kind of way, the audio I get is a less engaging voice that just speaks out the stuff.

Why is it such ?
I don't find any way to control that.

Reference TTS Example

@parthea parthea added the type: question Request for information or clarification. Not an issue. label Jan 20, 2025
@parthea
Copy link
Contributor

parthea commented Jan 21, 2025

Hi @AKSHILMY,

I'm going to transfer this issue to the python-docs-samples repository which is the source of truth for the code sample Reference TTS Example.

While running the code sample, I noticed that the response in audio_content when using streaming_synthesize contains headerless data.

audio_content (bytes):
The audio data bytes encoded as specified in
the request. This is headerless LINEAR16 audio
with a sample rate of 24000.

Please can you confirm that the necessary header was created to play the audio file? Please can you share the specific code used to create the audio header?

I created a code sample which contains the raw WAV file header (following the spec at https://docs.fileformat.com/audio/wav/) to help with debugging.

import google.cloud.texttospeech_v1 as texttospeech_v1
import itertools

client = texttospeech_v1.TextToSpeechClient()

# See https://cloud.google.com/text-to-speech/docs/voices for all voices.
streaming_config = texttospeech_v1.StreamingSynthesizeConfig(voice=texttospeech_v1.VoiceSelectionParams(name="en-US-Journey-F", language_code="en-US"))

# Set the config for your stream. The first request must contain your config, and then each subsequent request must contain text.
config_request = texttospeech_v1.StreamingSynthesizeRequest(streaming_config=streaming_config)

# Request generator. Consider using Gemini or another LLM with output streaming as a generator.
def request_generator():
    yield texttospeech_v1.StreamingSynthesizeRequest(input=texttospeech_v1.StreamingSynthesisInput(text="Movies, oh my gosh, I just just absolutely love them."))
    yield texttospeech_v1.StreamingSynthesizeRequest(input=texttospeech_v1.StreamingSynthesisInput(text="They're like time machines taking you to different worlds and landscapes,"))
    yield texttospeech_v1.StreamingSynthesizeRequest(input=texttospeech_v1.StreamingSynthesisInput(text="and um, "))
    yield texttospeech_v1.StreamingSynthesizeRequest(input=texttospeech_v1.StreamingSynthesisInput(text="and I just can't get enough of it."))

streaming_responses = client.streaming_synthesize(itertools.chain([config_request], request_generator()))

# This is a raw header based on the spec at https://docs.fileformat.com/audio/wav/
header = b'RIFF\x00\x00\x00\x00WAVEfmt \x10\x00\x00\x00\x01\x00\x01\x00\xc0]\x00\x00\x80\xbb\x00\x00\x02\x00\x10\x00data\x00\x00\x00\x00'

total_length = 0

with open(f"output.wav", "wb") as out:
    out.write(header)
    for response in streaming_responses:
        # calculate the length of the content
        total_length += len(response.audio_content)
        out.write(response.audio_content)
    # Position 40 - 43: Size of the data section
    out.seek(40)
    out.write(bytes([total_length & 0xFF, (total_length >> 8) & 0xFF, (total_length >> 16) & 0xFF, (total_length >> 24) & 0xFF]))

import os
file_size = os.path.getsize("output.wav")

with open(f"output.wav", "r+b") as out:
    # Position 4-7: Size of the overall file - 8 bytes, in bytes (32-bit integer). Typically, you’d fill this in after creation.
    out.seek(4)
    out.write(bytes([file_size & 0xFF, (file_size >> 8) & 0xFF, (file_size >> 16) & 0xFF, (total_length >> 24) & 0xFF]))

@parthea
Copy link
Contributor

parthea commented Jan 21, 2025

Googlers see b/391302662

@parthea
Copy link
Contributor

parthea commented Jan 27, 2025

I wasn't able to re-create the issue reported in #13405 (comment), however follow up issues (GoogleCloudPlatform/python-docs-samples#13080 and internal issue b/391302662) were created to improve the samples documentation, specifically the guidance around creating the necessary header for the WAV file.

I'm going to close this issue as not reproducible but please feel free to open a new issue if the problem is still present.

@parthea parthea closed this as completed Jan 27, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
needs more info This issue needs more information from the customer to proceed. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

2 participants