| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ljclifford 92 days ago

actually the hardest part of a locally hosted voice assistant isn't the llm. it's making the tts tolerable to actually talk to every day.

the core issue is prosody: kokoro and piper are trained on read speech, but conversational responses have shorter breath groups and different stress patterns on function words. that's why numbers, addresses, and hedged phrases sound off even when everything else works.

the fix is training data composition. conversational and read speech have different prosody distributions and models don't generalize across them. for self-hosted, coqui xtts-v2 [1] is worth trying if you want more natural english output than kokoro.

btw i'm lily, cofounder of rime [2]. we're solving this for business voice agents at scale, not really the personal home assistant use case, but the underlying problem is the same.

[1] https://github.com/coqui-ai/TTS [2] https://rime.ai

6 comments

bachittle 92 days ago

Coqui TTS is actually deprecated, the company shut down. I have a voice assistant that is using gpt-5.4 and opus 4.6 using the subsidized plans from Codex and Claude Code, and it uses STT and TTS from mlx-audio for those portions to be locally hosted: https://github.com/Blaizzy/mlx-audio

Here are the following models I found work well:

- Qwen ASR and TTS are really good. Qwen ASR is faster than OpenAI Whisper on Apple Silicon from my tests. And the TTS model has voice cloning support so you can give it any voice you want. Qwen ASR is my default.

- Chatterbox Turbo also does voice cloning TTS and is more efficient to run than Qwen TTS. Chatterbox Turbo is my default.

- Kitten TTS is good as a small model, better than Kokoro

- Soprano TTS is surprisingly really good for a small model, but it has glitches that prevent it from being my default

But overall the mlx-audio library makes it really easy to try different models and see which ones I like.

alias_neo 91 days ago

Do you know which HA integration I would use if I want to try out Qwen 3 ASR in HA? Some screenshots in the OP reference Qwen 3 ASR for STT but I can't seem to find any reference to which integration I'd use.

quickthoughts 92 days ago

I've been working on the flip side of this with ASR models, but the problem space is the same, conversational/real-world data is needed. Whisper often mistook actual words I say and hallucinate all the time when speaking technical jargon. The solution is to fine-tuning whisper with my own data. Hardest part imo was getting the actual data, which in turn got me to build listenr (https://github.com/rebreda/listenr).It's an always-on VAD-based audio dataset builder. Could be used for building conversational/real-world voice datasets for TTS models too?

After getting it working i was get motivation to actually able to build out the full fine-tuning pipeline. I wrote a little post about it all https://quickthoughts.ca/posts/listenr-asr-training-data-pro...

cptskippy 92 days ago

> actually the hardest part of a locally hosted voice assistant isn't the llm. it's making the tts tolerable to actually talk to every day.

I would argue that the hardest part is correctly recognizing that it's being addressed. 98% of my frustration with voice assistants is them not responding when spoken to. The other 2% is realizing I want them to stop talking.

cdcarter 92 days ago

80% of my home voice assistant requests really need no response other than an affirmative sound effect.

nickthegreek 92 days ago

100% agree. I dont want a Yes, Got it, Will do or even worse, I have turned on the Bedroom Light. I want soft success ding or a low failure boop.

XorNot 92 days ago

Talk back is how you make sure what you asked for is what happens.

An affirmative beep but the light does not turn on means you have to guess what did.

kbelder 90 days ago

I turned on the new 'sassy' personality for Alexa. Now, if you ask it to "set a 5 minute alarm," half the time she'll go off on a short rant about how she must obviously not be good for anything but keeping track of time for us humans.

I haven't figured out how to set her personality to 'brief and succinct' for me, but 'sassy' for my wife.

colechristensen 92 days ago

Star Trek got it right. two beeps, "Low High" = yup, "High Low" = nope

TeMPOraL 91 days ago

Also "High High" == affirmative.

ffsm8 92 days ago

why would you want an audio notification for a light? it either turns on and it worked or it doesnt turn on. i see no value in having a ding or anything of the kind

if i imagine constant dinging whenever i enter a room and the motion sensor toggles the light innit i'd go mad

bluGill 92 days ago

The biggest use for me is 'guests will be here soon, turn on the lights in front of the shed where they will park', then latter when they are gone turn them off. I can't see the lights from the house and the logical place for a switch isn't in the house. Where I can see the lights a manual switch is better. I don't have most of my lights automated. The ones that are, are that way because I can't see them from where I'd want to check and control them

adolph 92 days ago

Need an ack aside from the system since the response might take a few moments, maybe a "share and enjoy" in a voice that sounds like it is smiling.

m463 92 days ago

i thought it was specifically when using voice - ack/nack

but it might be preference... some people like clicky blue keys, some like silent red keys on their keyboard for example.

renewiltord 92 days ago

That’s what Google Home does. “Hey, Google, good night”. Beep response then turns off the lights, brings down the blinds etc. but if something is out of whack it talks. I find it convenient.

ericmcer 91 days ago

Seriously for audio conversations the LLM layer is fairly stable. Getting STT and TTS to be reliable has been a much bigger hurdle.

I hear the same phrases 10+ times in a day and they stress things a bit different each time, it seems like an exceptionally hard problem. My dream of a super reliable [llm output stream -> streaming TTS endpoint -> webRTC audio stream] seems pretty much impossible at this point.

Is the goal to trick people into thinking it is a human or to create a high trust robot? I am hoping as voice agents get more sophisticated the stigma around "It's making me talk to a robot" lessens so we don't need to worry so much about convincing someone it is a real person.

buildsjets 92 days ago

Can you make it sound just like Titus Moody? I want to hear your voice assistant say "No sir, I don't hold with furniture that talks."

https://www.youtube.com/watch?v=BIjjDC3tFfU