| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nowittyusername 148 days ago
	I have been working on playing around with over 10 stt systems in last 25 days and its really weird to read this article as my experience is the opposite. Stt models are amazing today. They are stupid fast, sound great and very simple to implement as huggingface spaces code is readily available for any model. Whats funny is that the model he was talking about "supertonic" was exactly the model I would have recommended if people wanted to see how amazing the tech has become. The model is tiny, runs 55x real time on any potato and sounds amazing. Also I think he is implementing his models wrong. As he mentions that some models don't have streaming and you have to wait for the whole chunk to be processed. But that's not a limit in any meaningful way as you can define the chunk. You can simply make the first n characters within the first sentence be the chunk and process that first and play that immediately while the rest of the text is being processed. ttfs and ttfa on all modern day models is well below 0.5 and for supertonic it was 0.05 with my tests.....

6 comments

jdp23 148 days ago

What screenreaders are you using to test the models with?

link

cachius 148 days ago

What's your experience at high speeds, with garbled speech artifacts and pronouncation accuracy?

link

nowittyusername 148 days ago

With supertonic , or overall? If overall most do pretty well though some are funky, like suprano was so bad no matter what I did, so i had to rule that out from my top contenders on anything. supertonic was close to my number one choice for my agentic pipeline as it was soo insanely fast and quality was great, but it didnt have the other bells and whistles like some other models so i held that off for cpu only projects in the future. If you are gonna use it on a GPU I would suggest chatterbox or pocket tts. Chatterbox is my top contender as of now because it sounds amazing, has cloning and i got it down to 0.26 ttfa/ttsa once i quantized it and implemented pipecat in to it. pocket tts is probably my second choice for similar reasons.

link

pixl97 148 days ago

>Also I think he is implementing his models wrong.

This is something I've noticed around a lot of AI related stuff. You really can't take any one article on it as definitive. This, and anything that doesn't publish how they fully implemented it is suspect. That's both for the affirmative and negative findings.

It reminds me a bit of the earlier days of the internet were there was a lot of exploration of ideas occurring, but quite often the implementation and testing of those ideas left much to be desired.

link

swores 147 days ago

Minor nitpick, but you mean "tts" not "stt" both times.

Is supertonic the best sounding model, or is there a different one you'd recommend that doesn't perform as well but sounds even better?

link

nowittyusername 147 days ago

yes sorry i mixed these up. supertonic is not the best sounding in my tests. it was by far the fastest, but its audio quality for something so fast was decent. if you wanted something that sounds better AND is also extremely fast pocket tts is the choice. amazing quality and also crazy fast on both gpu and cpu. if you care mainly about quality, chatterbox in my tests was best fit, but its slower then the others. qwen 3 tts was also great but its unisable as any real time agentic voice as its too slow. they havent relesed the code for streaming yet, once they release that this will be my top contender.

link

swores 147 days ago

Thanks!

link

8bitsrule 148 days ago

Just found this video ... it looks to sound and work -very- well. (RasPI & Onyx)

https://www.youtube.com/watch?v=bZ3I76-oJsc

link

noosphr 148 days ago

Are you using them at 1000 wpm?

link

nowittyusername 148 days ago

Supertonic is probably way faster then that, I wouldn't be surprised if measured it would be something like 14k wpm. On my 4090 I was getting about 175x real time while on cpu only it was 55x realtime. I stopped optimizing it but im sure it could be pushed further. Anyways you should check out their repo to test it yourself its crazy what that team accomplished!

link

gia_ferrari 148 days ago

Audio synthesis speed is one thing, but is the output _intelligible to a human_ at 1,000wpm? That's the sort of thing Eloquence is being used for, according to the article.

link

nowittyusername 148 days ago

TTS has no intelligence bud. Its only something that transforms text to audio. And that is all that we are talking about here. neither the article or anyone else was discussing the whole stt > llm > tts pipeline.

link

noosphr 148 days ago

https://www.merriam-webster.com/dictionary/intelligible

link

mrbukkake 147 days ago

Did you even read the article bud

link