Hacker News new | ask | show | jobs
by modeless 513 days ago
Why would you pirate a TTS service when there are so many great options for local open source TTS now? Models like Fish and Kokoro and StyleTTSv2 are great and very fast.

Click the leaderboard tab here: https://huggingface.co/spaces/TTS-AGI/TTS-Arena

5 comments

The models you shared only support the top ~10 languages / english only.

I believe the Edge API supports more models:

https://gist.github.com/BettyJJ/17cbaa1de96235a7f5773b8690a2...

Do you know any commercial licensed TTS that support 50+ languages and are relatively small (e.g. many small models, not 1 big model)? Meta's open models supports like 300 languages, but the license doesn't permit commercial use :-/

I have been experimenting with piper TTS recently, it's free, open source, fast and has a lot of voices in different languages but the quality is not the best but it's still good enough for most cases.

https://rhasspy.github.io/piper-samples/

For my native language, Norwegian, Piper TTS is at best "usable", and sometimes a fair bit worse than that. At least in its default form[1].

Especially the rhythm and timing is often very jarring making words difficult to understand, especially when the pitch is not quite right.

It also doesn't seem to know about pacing, ignoring semicolon and comma.

Combined I often need to think hard about what it just said, or even listen to it again.

I also notice these issues in the various English voice models to varying degrees, so seems to be an inherent problem. Or can it be improved significantly with training it yourself?

[1]: https://rhasspy.github.io/piper-samples/

I don’t know about Norwegian but I wonder if the issues are due to the training data.

I’m sure it’s possible to train new voices.

The English voices are hit or miss, but some voices have up to 900 speakers so it should be able to find a nice voice in the hay stack.

The thing I like about piper is it is so fast. I set it up to stream the output to VLC and it starts speaking in less than a second even on my laptop.

I wish it could have eleven labs quality but right now the speed is the most important factor for what I’m doing with it.

I saw that the piper-phonemize project linked to espeak-ng, and so I tried to pass the Piper sample text through espeak-ng and the way it phonemicized the text had the same rhythm issues that I noted in the TTS sample. Ie it put the stresses in the same wrong places in certain words and such.

This was also reflected in the voice output of espeak-ng, even though it's overall quality was vastly subpar compared to Piper TTS (as expected).

So it seems that improving this aspect might be one way to get better performance out of Piper for my language. Not sure how easy that'll be tho...

What TTS model has given the best results for you (for Norwegian)? I've tried MS Azure and it's pretty good, but not flawless.
I haven't found any open source that come close to the commercial offerings, though I admin I haven't tried 'em all.

Azure like you say is pretty decent, Google does an ok enough job but not as good.

Piper is superb for my needs. Runs extremely fast on CPU (so fast it can run in real time on a raspi) so it's perfect for use on laptops without dedicated GPUs. Subjectively, I'd say the quality is about on par with where MacOS's TTS was about 10 years ago, which is extremely usable.
I also have used Piper and agree it is worth trying out.
https://ttsvoicesavailable.streamlit.app

Acapela, Nuance - but its around 75 languages.

I really want southeast Asian languages (thai, laos, etc). seems only MS supports those.
Isn't that Nuance product EOL?
I don't know, but the Edge API is not licensed for any use, commercial or otherwise (outside of Edge itself).
"pirate"? This was always free.
The API endpoint was clearly intended for use only by Edge. Yes, reverse engineering the authentication (even if trivial) and using it for other applications, knowing that was not its intended use, I consider a form of piracy.
I'm not really sure how this is any different from a web crawler? I guess the issue would be republishing the content is bad.

But I thought the LinkedIn lawsuit settled that crawlers are ok, as long as you're not republishing the content?

That is a very hazardous slope to go down. We are already seeing user-agent discrimination and this is no different than using Bing from a browser that isn't Edge.
If Bing wasn't a public website and only accessable through the windows Search bar/Edge without reverse engineering the API I'd agree with you.

Comparing an API that typically requires a key and a public website is absurd.

It's still publicly accessible.
Typing anything with “r” into that text to speech box gives a random sentence instead
Is Kokoro open source? I couldn't find it's source anywhere.