Instead of being second rate citizens in most popular instant messaging apps, voice notes get equal footing in koe by integrating them more seamlessly in the message flow.
Voice notes as they come communicate no context by themselves, aside from the length of the message. That interrupts the flow of communication, as you are forced to switch between reading and listening, with no option to switch to only one or the other. In the case of koe, we give voice notes a room to breathe and allow users to read the voice notes.
Furthermore, using sentiment analysis, the message bubbles are colorized to communicate the emotion of the message. Much more thought needs to be put into how to better visualize the essence, emotion, and context of the voice notes, but this is a proof of concept start.
- SvelteKit
- Matrix protocol - matrix-js-sdk
- OpenAI gpt-4-1106-preview API (summarization)
- OpenAI's Whisper API (speech-to-text)
- Sam Lowe's roberta-base-go_emotions (sentiment analysis) model
- MediaStream Recording API (voice recording)
- wavesurfer.xyz (audio visualization library)
- PostgreSQL (db)
- Voice notes get summarized using speech-to-text, then run through gpt4-turbo, and summary is shown right below the voice note within the same text bubble;
- To show the full transcript, click on the summarized text, which will enlarge the text bubble;
- Improved playback of voice messages: instead of just blindly sliding the duration dial on the note, the slider will automatically skip only to the beginnings of sentences;
- Open message threads in a separate tab/window, similar to Discord;
- Down the road it would be cool to also add a text-to-speech option, so that all text messages get converted to speech, with the generated voice being that of the message sender;
- this can already be done with a very small amount of sample data (like elevenlabs), but the privacy concerns are too involved to exercise this idea so quick. but would be cool to figure out a way that, as long as all parties involved in the chat consent to it, their voices are used to train and later use for their text messages.
I'm using Matrix protocol due to it being an off-the-shelf, reliable, pretty mature E2EE messaging protocol that I can build a frontend around, instead of building my own.
This is a PoC I'm working on to satisfy my curiosity about whether voice notes can be better integrated into the messenger app communications flow, and is not intended to be a final product of any kind.
Set up a basic local Matrix environment (using Synapse homeserver)Integrate Matrix client SDK w/ user authentication and session persistenceCreate minimal chat interfaceImplement voice recording functionality- Integrate wavesurfer waveform visualization for voice recordings
- Integrate speech-to-text API for transcription
- Implement sentiment analysis
- Develop color-coded emotion visualization
- Automatically cut out long pauses between words (uhms, ahms, thinking breaks), make it optional for user in the settings
- Create interactive playback dashboard
- Implement threaded conversations feature
- Prototype testing
- Prepare a demo setup
- Let people shit on the idea
- Try to refine the prototype based on feedback
- Slowly lose hope
- Hope lost, queue despair
- Move on to the next idea