| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pzo 471 days ago

I will have a look at this. Played with pipecat before and it's great, switched to sherpa-onnx though since I need something that compile to native and can run on edge devices.

I'm not sure if turn detection can be really solved except dedicated push to talk button like in walkie-talkie. I often tried google translator app and the problem is in many times when you speaking longer sentence you will stop or slow down a little to gather thought before continuing talking (especially if you are not native speaker). For this reason I avoid converation mode in such cases like google translator and when using perplexity app I prefer the push to talk button mode instead of new one.

I think this could be solved but we would need not only low latency turn detection but also low latency speech interruption detection and also very fast low latency llm on device. And in case we have interruption good recovery that system know we continue last sentence instead of discarding previous audio and starting new etc.

Lots of things can be improved also regarding i/o latency, like using low latency audio api, very short audio buffer, dedicated audio category and mode (in iOS), using wired headsets instead of buildin speaker, turning off system processing like in iphone audio boosting or polar pattern. And streaming mode for all STT, transport (using using remote LLM), TTS. Not sure if we can have TTS in streaming mode. I think most of the time they split by sentence.

I think push to talk is a good solution if well designed: big button in place easily reached with your thumb, integration with iphone action button, using haptic for feedback, using apple watch as big push button, etc.

1 comments

genewitch 470 days ago

Whisper can chunk on word boundaries or split on word boundaries. The speaker diarization stuff, I can't remember the name offhand, but it also can split on the word boundaries since it needs to identify speakers per words.

link