Hacker News new | ask | show | jobs
by kastnerkyle 2762 days ago
This type of concatenation I first saw in Alex Graves' work on "Generating Sequences With Recurrent Neural Networks", including his unpublished TTS demo [1]. Biasing with part of another sentence (as in handwriting) can possibly improve style in TTS as well.

We followed this approach in char2wav [2], but "voice cloning" has come much farther in my opinion [3][4][5]. There's a lot of relevant research on techniques for this beyond concatenating indicators or embeddings, if people are interested in the research side of this technology.

[0] https://arxiv.org/abs/1308.0850

[1] https://www.youtube.com/watch?v=-yX1SYeDHbg&t=38m30s

[2] http://josesotelo.com/speechsynthesis/

[3] https://twitter.com/Jeanne_Heo/status/972089715225542657 (lyrebird.ai)

[4] https://google.github.io/tacotron/publications/gmvae_control...

[5] https://google.github.io/tacotron/publications/speaker_adapt...