Hacker News new | ask | show | jobs
by jpcl 886 days ago
Both Polish and English samples are actually synthesized with a voice trained on the WolneLektury audiobooks. They are the highest quality open source (CC BY-SA) audiobooks I could find.

By using the Whisper-derived phonetic representation (so called semantic tokens) we successfully trained a model with just a high-quality speech dataset of one language and the voice quality transferred to English.

2 comments

How much training compute does it require to train from scratch? I'm wondering because I have a lot of audiobooks, they're not necessarily CC licensed though but for my private usage and training I think it'd be fine.
Training the T2S model from scratch takes around 8h on 96 A100 GPUs. Training the `tiny` S2A model is around 3x faster (training HQ `small` variant is comparable to T2S).

I think you would get good results with fine-tuning but unfortunately we don't have a user-friendly notebook or script to do that right now. The biggest model is 800MB (FP32) so you won't even need a very big GPU to be able to fine-tune.

Link to these in English? I found some hits that may be correct for Polish - but I'm guessing they're hosted somewhere canonical?
https://wolnelektury.pl/katalog/audiobooki/ is the Polish audiobook collection.

The English audiobooks are public domain recordings from LibriVox (via the LibriLight dataset).

Thank you. Is the Polish collection also a volunteer effort?

Link to librivox for others: https://librivox.org/

Not really, the Polish effort is run by a non-profit and hired professional voice actors.