Hacker News new | ask | show | jobs
by koljab 402 days ago
It's in fact using Silero via RealtimeSTT. RealtimeSTT tells when silence starts. Then a binary sentence classification model is used on the realtime transcription text which infers blazingly fast (10ms) and returns a probability between 0 and 1 indicating if the current spoken sentence is considered "complete". The turn detection component takes this information to calculate the silence waiting time until "turn is over".
1 comments

This is the exact strategy I'm using for the real-time voice agent I'm building. Livekit also published a custom turn detection model that works really well based on the video they released, which was cool to see.

Code: https://github.com/livekit/agents/tree/main/livekit-plugins/... Blog: https://blog.livekit.io/using-a-transformer-to-improve-end-o...