Hacker News new | ask | show | jobs
by satvikpendem 996 days ago
What's the best open source text to speech? Eleven Labs and others are interesting but closed source. I want to use them mainly for audiobooks as I have a lot of ePubs and I'm just using the basic Google text to speech voices on my Android, via Moon+ Reader. It works fine but it's still more robotic than state of the art.
7 comments

POST-EDIT, CORRECTED ANSWER

I doubt it's currently actually "the best open source text to speech", but the answer I came up with when throwing a couple of hours at the problem some months ago was "ttsprech" [3].

Following the guide, it was pretty trivial to make the model render my sample text in about 100 English "voices" (many of which were similar to each other, and in varying quality). Sampling those, I got about 10 that were pretty "good". And maybe 6 that were the "best ones" (very natural, not annoying to listen to, actually sounded like a person by and large), and maybe 2 made the top (as in, a tossup for the most listenable, all factors considered).

IIRC, the license was free for noncommercial use only. I'm not sure exactly "how open source" they are, but it was simple to install the dependencies and write the basic Python to try it out; I had to write a for loop to try all the voices like I wanted. I ended using something else for the project for other reasons, but this could still be a fairly good backup option for some use cases, IMO.

PRE-EDIT, ERRONEOUS ANSWER

Same as above, but I had said "Silero" [0, 1, 2] originally, which I started trying out too, before switching to a third (less open) option.

  [0] https://github.com/snakers4/silero-models#text-to-speech
  [1] https://silero.ai
  [2] https://github.com/snakers4/silero-models#standalone-use
  [3] https://github.com/Grumbel/ttsprech#usage
For neutral sounding very fast/efficient voices, I find Coqui TTS VITS models to be very good. For slower, more expressive voice or voice cloning I think the Coqui TTS XTTS is good (or you can look at the mrq/tortoise-tts).

I'm still awaiting a StyleTTS2 implementation. The audio samples sound top notch: https://styletts2.github.io/

You're in luck, the code dropped 6 hours ago :) https://github.com/yl4579/StyleTTS2

Looks promising, I'm going to check it out too! MIT license, even! If it's fast enough for real time, it could be the new best option. The paper claims faster inference than VITS...

Ha awesome! I just checked the repo literally before I posted and it was still empty, thanks for the heads up, will give it a spin now.
Just a followup for those interested, inference implementation notes and comparison clip between StyleTTS2, TTS VITS, and XTTS: https://fediverse.randomfoo.net/notice/AaOgprU715gcT5GrZ2
Wow you got it working so fast! I'm still stuck in package manager hell trying to debug a million little issues.
In my post I link to my issue where I outline what I needed to do from a clean mamba env that might help.

Pytorch nightly (I use for cuda-12) doesn't work w Python 3.12, but if you stick w 3.11 or 3.10 you should be ok. Rest was just w/o version numbers if you're on a clean venv should be fine, however there's a bug in the Utils lib that requires a 1-line fix if you're trying to inference (also linked). nltk was the only dependency not listed so not bad compared to most code drops!

We bought the $300/month plan for a few months earlier this year... and you'd only get 40 hours of audio generation for that. It wasn't really sufficient to our needs.

How many audio books is 40 hours?

Also, while its voice cloning was truly amazing, every once in awhile the voice would get a little nutty and sound like an insect just flew down their throat, or maybe they had an LSD flashback. Normal normal normal then it's some Bobcat Goldthwaite skit. And if you dialed down that parameter (I think it's called stability?) then it goes monotone really quickly.

We're probably several years out from it being something people use personally for audio books.

> We bought the $300/month plan for a few months earlier this year... and you'd only get 40 hours of audio generation for that. It wasn't really sufficient to our needs.

All of these AI as a Service (AaaS?) API companies are going to race each other to razor thin margins. Immediately after ElevenLabs raised, five other TTS services raised nearly the same amount of money.

>How many audio books is 40 hours?

Are you reading War & Peace or Cat In The Hat?

I always assume 200.to 250 pages per book when someone talks about large quantities of books.
That's fairly short. I read about 100 books a year and it includes thousand page tomes like The Count of Monte Cristo.
I always assumed that book to be rather short since it just needs to be a number of sandwiches eaten.

100 books/year. That's an impressive feat regardless the number of pages. Are these downloaded ebooks or physical printed copies of books?

It's mostly audiobooks, I have some ePubs that don't have audiobooks anywhere, such as many Japanese light novel fan (or official) translations into English for example. I can get through them as I can understand audio faster than I can read text, as I play back at 3 to 5x speed.
I like to read with my eyes, not listen. I honestly have no idea how long an audio book is, hours-wise.

I've seen a few for download, and they're always like hundreds of meg, if not over a gig. And that's in mp3, where it should be compressed heavily.

In my audible library, the shortest is the first Hitchhiker's Guide to the Galaxy a 5h51m. The longest is The Power Broker at 66h9m. Most of the books I have are in the 15-25 hour range, but I also have a lot of fantasy stuff that gets near 50 hours (Game of Thrones, Brandon Sanderson...).
Well, then we're talking $300 to have ElevenLabs do a single GoT book, but maybe as many as 8 books for HHGTG-style stuff.

That's just not good value. Was sort of my point.

I've tried a few, not an expert, but I think Coqui's new XTTS models are decent performance and quality wise (just in terms of how the speech sounds, can't speak to the voice cloning fidelity as I don't care about that). Open source code but non-commercial license for the model. They also have a bunch of models with more permissive licenses that aren't as good.

I doubt they're better than Google's TTS though.

Bark seems pretty good

https://github.com/suno-ai/bark Demo at https://huggingface.co/spaces/suno/bark

In the couple samples I tried it was substantially better at picking up meaning compared to VALL-E-X

> What's the best open source text to speech?

I haven't re-evaluated OSS TTS options for a few months but from my own experience earlier in the year I've been pleased with the results I've gotten from Piper:

* https://github.com/rhasspy/piper

I've primarily used it with the LibriTTS-based voices due to their license but if it's for personal local use you can probably use some of the other even higher quality voices.

The official samples are here: https://rhasspy.github.io/piper-samples/

Here's a small number of pre-rendered samples I've used that were generated from a WIP Piper port of my Dialogue Tool[0] project: https://rancidbacon.gitlab.io/piper-tts-demos/

While it's not perfect & output quality varies for a number of reasons, I've been using it because it's MIT licensed & there's multiple diverse voice options with licenses that suit my purposes.

(Piper and its predecessors Larynx & Mimic3 are significantly ahead of where other FLOSS options had been up until their existence in terms of quality.)

[0] https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-to...

----

Edit to add links to some of my notes related to FLOSS TTS, in case they're of interest:

* https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...

* https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...

* https://gitlab.com/RancidBacon/notes_public/-/blob/main/note...

Would also like to know this. Can't seem to find an open source tts engine that works on mobile to read muh books