Hacker News new | ask | show | jobs
by archerx 513 days ago
I have been experimenting with piper TTS recently, it's free, open source, fast and has a lot of voices in different languages but the quality is not the best but it's still good enough for most cases.

https://rhasspy.github.io/piper-samples/

3 comments

For my native language, Norwegian, Piper TTS is at best "usable", and sometimes a fair bit worse than that. At least in its default form[1].

Especially the rhythm and timing is often very jarring making words difficult to understand, especially when the pitch is not quite right.

It also doesn't seem to know about pacing, ignoring semicolon and comma.

Combined I often need to think hard about what it just said, or even listen to it again.

I also notice these issues in the various English voice models to varying degrees, so seems to be an inherent problem. Or can it be improved significantly with training it yourself?

[1]: https://rhasspy.github.io/piper-samples/

I don’t know about Norwegian but I wonder if the issues are due to the training data.

I’m sure it’s possible to train new voices.

The English voices are hit or miss, but some voices have up to 900 speakers so it should be able to find a nice voice in the hay stack.

The thing I like about piper is it is so fast. I set it up to stream the output to VLC and it starts speaking in less than a second even on my laptop.

I wish it could have eleven labs quality but right now the speed is the most important factor for what I’m doing with it.

I saw that the piper-phonemize project linked to espeak-ng, and so I tried to pass the Piper sample text through espeak-ng and the way it phonemicized the text had the same rhythm issues that I noted in the TTS sample. Ie it put the stresses in the same wrong places in certain words and such.

This was also reflected in the voice output of espeak-ng, even though it's overall quality was vastly subpar compared to Piper TTS (as expected).

So it seems that improving this aspect might be one way to get better performance out of Piper for my language. Not sure how easy that'll be tho...

What TTS model has given the best results for you (for Norwegian)? I've tried MS Azure and it's pretty good, but not flawless.
I haven't found any open source that come close to the commercial offerings, though I admin I haven't tried 'em all.

Azure like you say is pretty decent, Google does an ok enough job but not as good.

Piper is superb for my needs. Runs extremely fast on CPU (so fast it can run in real time on a raspi) so it's perfect for use on laptops without dedicated GPUs. Subjectively, I'd say the quality is about on par with where MacOS's TTS was about 10 years ago, which is extremely usable.
I also have used Piper and agree it is worth trying out.