| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mring33621 415 days ago
	the long 'uuuuhhhhhhh' from some of the lesser models is killing me.

3 comments

gapeleon 414 days ago

This finetune seems pretty stable (1b llasa) https://huggingface.co/spaces/HKUST-Audio/Llasa-1B-multi-spe...

1B is actually huge for a TTS model. Here's an 82m model with probably the most stable/coherent output of all the open weights tts models I've tested: https://huggingface.co/spaces/hexgrad/Kokoro-TTS

But if you mean zero-shot cloning, yeah they all seem to have those slurred speech artefacts from time to time.

link

jszymborski 415 days ago

based on the samples, it really seams like anything smaller than 3B is pretty useless.

link

hadlock 415 days ago

If you're doing a home lab voice assistant 1B is nice, because on a 12gb gpu you can run a moderately competent 7b LLM and two 1b models; 1 for speech to text and also text to speech, plus some for the wake word monitor. Maybe in a couple of years we can combine all this into a single ~8b model that runs efficiently on 12gb gpu. Nvidia doesn't seem very incentivized right now to sell consumer GPUs that can run all this on a single consumer grade chip when they're making so much money selling commercial grade 48gb cards.

link

Dlemo 414 days ago

Hui for the activation word?

Shouldn't there be some hardware module be available similar to how Alexa, Siri and Google do it?

Whith a ring buffer detection the word without recording everything?

link

nialv7 414 days ago

the mispronunciation of 行 and 行 in the Chinese sample is killing me too XD

link