| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by koljab 450 days ago

I built RealtimeVoiceChat because I was frustrated with the latency in most voice AI interactions. This is an open-source (MIT license) system designed for real-time, local voice conversations with LLMs.

Quick Demo Video (50s): https://www.youtube.com/watch?v=HM_IQuuuPX8

The goal is to get closer to natural conversation speed. It uses audio chunk streaming over WebSockets, RealtimeSTT (based on Whisper), and RealtimeTTS (supporting engines like Coqui XTTSv2/Kokoro) to achieve around 500ms response latency, even when running larger local models like a 24B Mistral fine-tune via Ollama.

Key aspects: Designed for local LLMs (Ollama primarily, OpenAI connector included). Interruptible conversation. Smart turn detection to avoid cutting the user off mid-thought. Dockerized setup available for easier dependency management.

It requires a decent CUDA-enabled GPU for good performance due to the STT/TTS models.

Would love to hear your feedback on the approach, performance, potential optimizations, or any features you think are essential for a good local voice AI experience.

The code is here: https://github.com/KoljaB/RealtimeVoiceChat

10 comments

zaggynl 450 days ago

Neat! I'm already using openwebui/ollama with a 7900 xtx but the STT and TTS parts don't seem to work with it yet:

2025-05-05 20:53:15,808] [WARNING] [real_accelerator.py:194:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.

Error loading model for checkpoint ./models/Lasinya: This op had not been implemented on CPU backend.

dankwizard 450 days ago

I've given up trying to locally use LLMs on AMD

lhl 450 days ago

Basically anything llama.cpp (Vulkan backend) should work out of the box w/o much fuss (LM Studio, Ollama, etc).

The HIP backend can have a big prefill speed boost on some architectures (high-end RDNA3 for example). For everything else, I keep notes here: https://llm-tracker.info/howto/AMD-GPUs

peterldowns 450 days ago

Can you explain more about the "Coqui XTTS Lasinya" models that the code is using? What are these, and how were they trained/finetuned? I'm assuming you're the one who uploaded them to huggingface, but there's no model card or README https://huggingface.co/KoljaB/XTTS_Models

In case it's not clear, I'm talking about the models referenced here. https://github.com/KoljaB/RealtimeVoiceChat/blob/main/code/a...

wkat4242 449 days ago

Yeah I really dislike the whisperiness of this voice "Lasinya". It sounds too much like an erotic phone service. I wonder if there's any alternative voice? I don't see Lasinya even mentioned in the public coqui models: https://github.com/coqui-ai/STT-models/releases . But I don't see a list of other model names I could use either.

I tried to select kokoro in the python module but it says in the logs that only coqui is available. I do have to say the coqui models sound really good, it's just the type of voice that puts me off.

The default prompt is also way too "girlfriendy" but that was easily fixed. But for the voice, I simply don't know what the other options are for this engine.

PS: Forgive my criticism of the default voice but I'm really impressed with the responsiveness of this. It really responds so fast. Thanks for making this!

koljab 448 days ago

Yeah I know the voice polarizes, I trained it for myself, so it's not an official release. You can change the voice here:

https://github.com/KoljaB/RealtimeVoiceChat/blob/main/code/a...

Create a subfolder in the app container: ./models/some_folder_name Copy the files from your desired voice into that folder: config.json, model.pth, vocab.json and speakers_xtts.pth (you can copy the speakers_xtts.pth from Lasinya, it's the same for every voice)

Then change the specific_model="Lasinya" line in audio_module.py into specific_model="some_folder_name".

If you change TTS_START_ENGINE to "kokoro" in server.py it's supposed to work, what does happen then? Can you post the log message?

wkat4242 448 days ago

Thank you!

I didn't realise that you custom-made that voice. Would you have some links to other out-of-the-box voices for coqui? I'm having some trouble finding them. I think from seeing the demo page that the idea is that you clone someone else's voice or something with that engine. Because I don't see any voices listed. I've never seen it before.

And yes I switched to Kokoro now, I thought it was the default already but then I saw there were 3 lines configuring the same thing. So that's working. Kokoro isn't quite as good though as coqui, that's why I'm wondering about that. I also used kokoro on openwebui and I wasn't very happy with it there either. It's fast, but some pronounciation is weird. Also, it would be amazing to have bilingual TTS (English/Spanish in my case). And it looks like Coqui might be able to do that.

koljab 448 days ago

Didn't find many coqui finetunes too so far. I have David Attenborough and Snoop Dogg finetunes on my huggingface, quality is medium.

Coqui can to 17 languages. The problem with RealtimeVoiceChat repo is turn detection, the model I use to determine if a partial sentence indicates turn change is trained on english corpus only.

Buckaroo9 450 days ago

https://huggingface.co/coqui/XTTS-v2

optimog 450 days ago

Seems like they are out of business. Their homepage mentions "Coqui is schutting down"* That is probably the reason you can't find that much.

*https://coqui.ai/

koljab 448 days ago

Lasinya voice is a XTTS 2.0.2 finetune I made with a self-created, synthesized dataset. I used https://github.com/daswer123/xtts-finetune-webui for training.

dummydummy1234 450 days ago

Have you looked at pipecat, seems to be similar trying to do standardized backend/webrtc turn detection pipelines.

koljab 449 days ago

Did not look into that one. Looks quite good, I will try that soon.

ivape 450 days ago

Would you say you are using the best-in-class speech to text libs at the moment? I feel like this space is moving fast because the last time I was headed down this track, I was sure whisper-cpp was the best.

koljab 450 days ago

I'm not sure tbh. Whisper was king for so long time now, especially with the ctranslate2 implementation from faster_whisper. Now nvidia open sourced Parakeet TDT today and it instantly went no 1 on open asr leaderboard. Will have to evaluate these latest models, they look strong.

kristopolous 450 days ago

https://yummy-fir-7a4.notion.site/dia is the new hotness.

koljab 450 days ago

Tried that one. Quality is great but sometimes generations fail and it's rather slow. Also needs ~13 GB of VRAM, it's not my first choice for voice agents tbh.

kristopolous 450 days ago

alright, dumb question.

(1) I assume these things can do multiple languages

(2) Given (1), can you strip all the languages you aren't using and speed things up?

koljab 450 days ago

Actually good question.

I'd say probably not. You can't easily "unlearn" things from the model weights (and even if this alone doesn't help). You could retrain/finetune the model heavily on a single language but again that alone does not speed up inference.

To gain speed you'd have to bring the parameter count down and train the model from scratch with a single language only. That might work but it's also quite probable that it introduces other issues in the synthesis. In a perfect world the model would only use all that "free parameters" not used now for other languages for a better synthesis of that single trained language. Might be true to a certain degree, but it's not exactly how ai parameter scaling works.

oezi 450 days ago

Paraket is english only. Stick with Whisper.

The core innovation is happening in TTS at the moment.

ivape 450 days ago

Yeah, I figured you would know. Thanks for that, bookmarking that asr leaderboard.

riquito 450 days ago

Very cool, thanks for sharing.

A couple questions: - any thought about wake word engines, to have something that listen without consuming all the time? The landscape for open solutions doesn't seem good - any plan to allow using external services for stt/tts for the people who don't have a 4090 ready (at the cost of privacy and sass providers)?

TeMPOraL 449 days ago

FWIW, wake words are a stopgap; if we want to have a Star Trek level voice interfaces, where the computer responds only when you actually meant to call it, as opposed to using the wake word as a normal word in the conversation, the computer needs to be constantly listening.

A good analogy here is to think of the computer (assistant) as another person in the room, busy with their own stuff but paying attention to the conversations happening around them, in case someone suddenly requests their assistance.

This, of course, could be handled by a more lightweight LLM running locally and listening for explicit mentions/addressing the computer/assistant, as opposed to some context-free wake words.

Dr4kn 449 days ago

Home Assistant is much nearer to this than other solutions.

You have a wake word, but it can also speak to you based on automations. You come home and it could tell you that the milk is empty, but with a holiday coming up you probably should go shopping.

Dlemo 449 days ago

I want that for privacy reasons and for resource reasons.

And having this as a small hardware device should not add relevant latency to it.

jillyboel 449 days ago

Privacy isn't a concern when everything is local

Dlemo 449 days ago

Yes it is.

Malware, bugs etc can happen.

And I also might not want to disable it for every guest either.

ben_w 449 days ago

If the AI is local, it doesn't need to be on an internet connected device. At that point, malware and bugs in that stack don't add extra privacy risks* — but malware and bugs in all your other devices with microphones etc. remain a risk, even if the LLM is absolutely perfect by whatever standard that means for you.

* unless you put the AI on a robot body, but that's then your own new and exciting problem.

jillyboel 449 days ago

There is no privacy difference between a local LLM listening versus a local wake word model listening.

koljab 449 days ago

That would be quite easy to integrate. RealtimeSTT already has wakeword support for both pvporcupine and openwakewords.

justlikereddit 449 days ago

Modify it with an ultra light LLM agent that always listens that uses a wake word to agentically call the paid API?

Dr4kn 449 days ago

You could use open wake word. Which Home Assistant developed for its own Voice Assistant

supermatt 449 days ago

It was developed by David Scripka: https://github.com/dscripka/openWakeWord

jokethrowaway 449 days ago

Neat!

I build something almost identical last week (closed source, not my IP) and I recommend: NeMo Parakeet (even faster than insanely_fast_whisper), F5-TTS (fast + very good quality voice cloning), Qwen3-4B for LLM (amazing quality).

pzo 449 days ago

This looks great will definitely have a look. I'm just wondering if you tested fastRTC from hugging face? I haven't done that curious about speed between this vs fastrtc vs pipecat.

koljab 449 days ago

Yes, I tested it. I'm not that sure what they created there. It adds some noticable latency compared towards using raw websockets. Imho it's not supposed to, but it did it nevertheless in my tests.

karimf 450 days ago

Do you have any information on how long each step take? Like how many ms on each step of the pipeline?

I'm curious how fast it will run if we can get this running on a Mac. Any ballpark guess?

koljab 448 days ago

LLM and TTS latency get's determined and logged at the start. It's around 220ms for the LLM returning the first synthesizable sentence fragment (depending on the length of the fragment, which is usually something between 3 and 10 words). Then around 80ms of TTS until the first audio chunk is delivered. STT with base.en you can neglect, it's under 5 ms, VAD same. Turn detection model also adds around 20 ms. I have zero clue if and how fast this runs on a Mac.

tmaly 449 days ago

What is the min VRAM needed on the GPU to run this? I did not see that on the github

koljab 448 days ago

With the current 24b LLM model it's 24 GB. I have no clue how far down you can go with the GPU is using smaller models, you can set the model in server.py. Quite sure 16 GB will work but at some point it will probably fail.

dotancohen 450 days ago

This looks great. What hardware do you use, or have you tested it on?

koljab 450 days ago

I only tested it on my 4090 so far

echelon 450 days ago

Are you using all local models, or does it also use cloud inference? Proprietary models?

Which models are running in which places?

Cool utility!

koljab 450 days ago

All local models: - VAD: Webrtcvad (first fast check) followed by SileroVAD (high compute verification) - Transcription: base.en whisper (CTranslate2) - Turn Detection: KoljaB/SentenceFinishedClassification (selftrained BERT-model) - LLM: hf.co/bartowski/huihui-ai_Mistral-Small-24B-Instruct-2501-abliterated-GGUF:Q4_K_M (easily switchable) - TTS: Coqui XTTSv2, switchable to Kokoro or Orpheus (this one is slower)

echelon 450 days ago

That's excellent. Really amazing bringing all of these together like this.

Hopefully we get an open weights version of Sesame [1] soon. Keep watching for it, because that'd make a killer addition to your app.

[1] https://www.sesame.com/

koljab 448 days ago

That would be absolutely awesome. But I doubt it, since they released a shitty version of that amazing thing they put online. I feel they aren't planning to give us their top model soon.