Thank you so much for this link, that is the best text-to-speech with an open architecture I've ever heard 'til now. Under https://github.com/keithito/tacotron you can find a pre-trained model based on this paper, although it isn't matching the quality yet. Maybe I can get some cluster time to train a new model using multiple datasets.