Skip to content

Added support for gpt4o-realtime models for Speect to Speech interactions #659

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sharananurag998
Copy link

This PR introduces real-time voice pipeline support for OpenAI’s gpt-4o-realtime-preview model, enabling seamless, low-latency speech-to-speech interactions in the Speect framework. The update brings a modern, streaming audio interface, integrated tool execution, and robust event handling—while maintaining full compatibility with the existing STT/TTS pipeline.


Key Features & Changes

  • RealtimeVoicePipeline:

    • New pipeline for direct, continuous audio-to-audio conversations with OpenAI’s real-time models.
    • Handles streaming microphone input and speaker output at 24kHz, as required by the API.
    • Supports push-to-talk and half-duplex operation to prevent echo/feedback.
  • Integrated Tool Calls:

    • Tools are registered with the pipeline and executed automatically when the model requests a function call.
    • Tool results are sent back to the model using the correct OpenAI Realtime API protocol.
  • Event Handling & Debugging:

    • Full support for all major OpenAI Realtime API events, including:
      • Audio and text deltas
      • Tool call arguments (streamed and completed)
      • Transcription events (conversation.item.input_audio_transcription.delta and .completed)
      • Session and rate limit updates
    • Example logs all transcription events for easy debugging of what the model “hears.”
  • Echo & Feedback Mitigation:

    • Implements a buffer window after assistant audio playback to prevent microphone echo from triggering new turns.
    • Optionally enables server-side noise/echo reduction via input_audio_noise_reduction in the session config.
  • Sample Rate Fixes:

    • Ensures both input and output audio are always 24kHz PCM, as required by the OpenAI API (fixes “slow motion” audio bug).
  • Backwards Compatibility:

    • All changes are fully compatible with the existing STT/TTS pipeline and configuration.
    • Legacy examples and workflows continue to work without modification.
  • Documentation & Examples:

    • Updated docs/voice/pipeline.md with new real-time usage, configuration, and troubleshooting sections.
    • New example: continuous_realtime_assistant.py demonstrates push-to-talk, tool calls, and event handling.

🛠️ How to Use

  • Realtime Pipeline:
    See the new example and documentation for how to use RealtimeVoicePipeline with your OpenAI API key and tools.
  • Classic Pipeline:
    No changes required—existing STT/TTS flows are unaffected.

…ions

- Added detailed documentation for the new `RealtimeVoicePipeline`, including usage examples and event handling for real-time audio interaction.
- Introduced a new example script demonstrating the `RealtimeVoicePipeline` with continuous audio streaming and tool execution.
@sharananurag998 sharananurag998 force-pushed the main branch 3 times, most recently from 8bcb389 to b8899f7 Compare May 7, 2025 11:06
@sharananurag998 sharananurag998 marked this pull request as draft May 7, 2025 14:29
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant