| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Spiwux 915 days ago
	I wonder if we're at a point where you could build a voice assistant like that, except almost-realtime and streamed end to end: User speaks and speech to text starts streaming text while the user is still speaking. That text stream is piped into a LLM, which also streams its output text. That output text is streamed to text-to-speech, which also generates audio in a streaming manner.

6 comments

modeless 915 days ago

I implemented this! All local models. And I packaged it up so people can install it with one click: https://apps.microsoft.com/detail/9NC624PBFGB7

The speech recognition part needs work for sure, but when it works you can see the potential. It's very different from the way it feels to talk to Siri or even ChatGPT's voice mode. It won't be long before we are having real conversations with our computers.

link

bjelkeman-again 915 days ago

Could you record a demo of this?

link

modeless 915 days ago

I really should! I'm not the type to publish videos of myself usually, but it really does need a video demo.

link

3abiton 915 days ago

But how realtime is it?

link

modeless 915 days ago

The end-to-end response latency is around 1 second typically. It listens continuously, there are no buttons to press, and you can interrupt it while it's talking.

link

evilantnie 915 days ago

TTS and STT models have decent support for streaming in chunks, but the accuracy drops the smaller the chunk size. Current state of LLMs are pretty limited in their ability to handle streaming inputs due to attention window constraints. There is some emerging research into attention sinks and caching initial tokens that look promising. I don't think we're quite there yet though.

link

everforward 915 days ago

You can do the "almost-realtime" part, all locally. I tinkered with a Python script for a few hours that used Whisper to speech-to-text, fed that into a local Mistral model (don't recall which), and then piped the output into text-to-speech.

It wasn't really streamed, though. Audio input was buffered, fully evaluated to a string, then fed into the LLM and the full text was converted back to audio.

The Whisper speech-to-text was pretty real-time, the LLM was not. I was barely scraping by on hardware specs, though.

link

canadiantim 915 days ago

you try using ESP box?

link

zaptrem 915 days ago

Available as a phone line API (https://www.vocode.dev) and OS project (https://github.com/vocodedev/vocode-python)

link

adroitboss 915 days ago

This has happened already. It was maybe about 7 months ago and I believe it was a twitter link posted here. They took it further and streamed it to twilio to create a live phone call.

link

fudged71 915 days ago

The one I tried was called Vocode

link

WiSaGaN 915 days ago

Any existing stream api for llm input?

link