Hacker News new | ask | show | jobs
by everforward 917 days ago
You can do the "almost-realtime" part, all locally. I tinkered with a Python script for a few hours that used Whisper to speech-to-text, fed that into a local Mistral model (don't recall which), and then piped the output into text-to-speech.

It wasn't really streamed, though. Audio input was buffered, fully evaluated to a string, then fed into the LLM and the full text was converted back to audio.

The Whisper speech-to-text was pretty real-time, the LLM was not. I was barely scraping by on hardware specs, though.

1 comments

you try using ESP box?