Voice Experience Issue in Google Text to Speech (streaming_synthesize) #13405

AKSHILMY · 2025-01-07T12:57:32Z

The use of Journey voices in normal TTS gives an engaging voice that attracts a user.
But when I use the same Journey voice in a streaming text input -> streaming audio output kind of way, the audio I get is a less engaging voice that just speaks out the stuff.

Why is it such ?
I don't find any way to control that.

Reference TTS Example

parthea · 2025-01-21T01:25:07Z

Hi @AKSHILMY,

I'm going to transfer this issue to the python-docs-samples repository which is the source of truth for the code sample Reference TTS Example.

While running the code sample, I noticed that the response in audio_content when using streaming_synthesize contains headerless data.

google-cloud-python/packages/google-cloud-texttospeech/google/cloud/texttospeech_v1/types/cloud_tts.py

Lines 789 to 792 in 8bdc5d2

    
                   audio_content (bytes): 
        
                       The audio data bytes encoded as specified in 
        
                       the request. This is headerless LINEAR16 audio 
        
                       with a sample rate of 24000.

Please can you confirm that the necessary header was created to play the audio file? Please can you share the specific code used to create the audio header?

I created a code sample which contains the raw WAV file header (following the spec at https://docs.fileformat.com/audio/wav/) to help with debugging.

import google.cloud.texttospeech_v1 as texttospeech_v1
import itertools

client = texttospeech_v1.TextToSpeechClient()

# See https://cloud.google.com/text-to-speech/docs/voices for all voices.
streaming_config = texttospeech_v1.StreamingSynthesizeConfig(voice=texttospeech_v1.VoiceSelectionParams(name="en-US-Journey-F", language_code="en-US"))

# Set the config for your stream. The first request must contain your config, and then each subsequent request must contain text.
config_request = texttospeech_v1.StreamingSynthesizeRequest(streaming_config=streaming_config)

# Request generator. Consider using Gemini or another LLM with output streaming as a generator.
def request_generator():
    yield texttospeech_v1.StreamingSynthesizeRequest(input=texttospeech_v1.StreamingSynthesisInput(text="Movies, oh my gosh, I just just absolutely love them."))
    yield texttospeech_v1.StreamingSynthesizeRequest(input=texttospeech_v1.StreamingSynthesisInput(text="They're like time machines taking you to different worlds and landscapes,"))
    yield texttospeech_v1.StreamingSynthesizeRequest(input=texttospeech_v1.StreamingSynthesisInput(text="and um, "))
    yield texttospeech_v1.StreamingSynthesizeRequest(input=texttospeech_v1.StreamingSynthesisInput(text="and I just can't get enough of it."))

streaming_responses = client.streaming_synthesize(itertools.chain([config_request], request_generator()))

# This is a raw header based on the spec at https://docs.fileformat.com/audio/wav/
header = b'RIFF\x00\x00\x00\x00WAVEfmt \x10\x00\x00\x00\x01\x00\x01\x00\xc0]\x00\x00\x80\xbb\x00\x00\x02\x00\x10\x00data\x00\x00\x00\x00'

total_length = 0

with open(f"output.wav", "wb") as out:
    out.write(header)
    for response in streaming_responses:
        # calculate the length of the content
        total_length += len(response.audio_content)
        out.write(response.audio_content)
    # Position 40 - 43: Size of the data section
    out.seek(40)
    out.write(bytes([total_length & 0xFF, (total_length >> 8) & 0xFF, (total_length >> 16) & 0xFF, (total_length >> 24) & 0xFF]))

import os
file_size = os.path.getsize("output.wav")

with open(f"output.wav", "r+b") as out:
    # Position 4-7: Size of the overall file - 8 bytes, in bytes (32-bit integer). Typically, you’d fill this in after creation.
    out.seek(4)
    out.write(bytes([file_size & 0xFF, (file_size >> 8) & 0xFF, (file_size >> 16) & 0xFF, (total_length >> 24) & 0xFF]))

parthea · 2025-01-21T11:36:19Z

Googlers see b/391302662

parthea · 2025-01-27T16:38:37Z

I wasn't able to re-create the issue reported in #13405 (comment), however follow up issues (GoogleCloudPlatform/python-docs-samples#13080 and internal issue b/391302662) were created to improve the samples documentation, specifically the guidance around creating the necessary header for the WAV file.

I'm going to close this issue as not reproducible but please feel free to open a new issue if the problem is still present.

parthea added the type: question Request for information or clarification. Not an issue. label Jan 20, 2025

parthea mentioned this issue Jan 21, 2025

Provide guidance on creating a header for streaming_synthesize in streaming_tts_quickstart.py GoogleCloudPlatform/python-docs-samples#13080

Open

parthea added the needs more info This issue needs more information from the customer to proceed. label Jan 21, 2025

parthea self-assigned this Jan 21, 2025

parthea closed this as completed Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Voice Experience Issue in Google Text to Speech (streaming_synthesize) #13405

Voice Experience Issue in Google Text to Speech (streaming_synthesize) #13405

AKSHILMY commented Jan 7, 2025 •

edited

Loading

parthea commented Jan 21, 2025

parthea commented Jan 21, 2025

parthea commented Jan 27, 2025

Voice Experience Issue in Google Text to Speech (streaming_synthesize) #13405

Voice Experience Issue in Google Text to Speech (streaming_synthesize) #13405

Comments

AKSHILMY commented Jan 7, 2025 • edited Loading

parthea commented Jan 21, 2025

parthea commented Jan 21, 2025

parthea commented Jan 27, 2025

AKSHILMY commented Jan 7, 2025 •

edited

Loading