Hacker News new | ask | show | jobs
by mring33621 415 days ago
the long 'uuuuhhhhhhh' from some of the lesser models is killing me.
3 comments

This finetune seems pretty stable (1b llasa) https://huggingface.co/spaces/HKUST-Audio/Llasa-1B-multi-spe...

1B is actually huge for a TTS model. Here's an 82m model with probably the most stable/coherent output of all the open weights tts models I've tested: https://huggingface.co/spaces/hexgrad/Kokoro-TTS

But if you mean zero-shot cloning, yeah they all seem to have those slurred speech artefacts from time to time.

based on the samples, it really seams like anything smaller than 3B is pretty useless.
If you're doing a home lab voice assistant 1B is nice, because on a 12gb gpu you can run a moderately competent 7b LLM and two 1b models; 1 for speech to text and also text to speech, plus some for the wake word monitor. Maybe in a couple of years we can combine all this into a single ~8b model that runs efficiently on 12gb gpu. Nvidia doesn't seem very incentivized right now to sell consumer GPUs that can run all this on a single consumer grade chip when they're making so much money selling commercial grade 48gb cards.
Hui for the activation word?

Shouldn't there be some hardware module be available similar to how Alexa, Siri and Google do it?

Whith a ring buffer detection the word without recording everything?

the mispronunciation of 行 and 行 in the Chinese sample is killing me too XD