Hacker News new | ask | show | jobs
by jdietrich 3091 days ago
I don't think it's an insurmountable challenge.

Tacotron 2 was trained on 24 hours of single-speaker transcribed audio, which is comparable to the freely-available LJ Speech Dataset. We know that it's feasible to train using unaligned transcribed speech, which broadens the opportunities to reuse an existing corpus. (Aside: does anyone know the legal status of training a DL model on copyrighted content?) Tacotron 2 was trained on a 32 GPU cluster, which is large but not absurdly so; we are seeing drastic performance increases in low-precision compute, which should hopefully start to trickle down as Volta reaches the market.

Expertise is a bigger challenge at this stage, although that is progressively changing. A huge number of developers are taking a serious interest in deep learning, so hopefully we'll start to see more active contributors to DL FOSS projects. The teams at DeepMind and Baidu Research are clearly highly skilled but relatively small, which suggests that their efforts could be replicated by a small but determined team of FOSS developers.

2 comments

It's not just about the amount of data, though. The speech and audio quality of Google's TTS data is likely better than the audio contained in the LJ dataset (disclaimer: I've only listened to samples contained in the latter, which have some audible reverberation). Ideally, you'd use a professionally trained speaker and record them in a (semi-)anechoic chamber.
>Aside: does anyone know the legal status of training a DL model on copyrighted content?

How would anyone know that it happened?

Not a legal opinion:

Seems like it would fit under Fair Use in USA but wouldn't be allowed under UK's Fair Dealing; I can't comment on other jurisdictions.