|
|
|
|
|
by jdietrich
3091 days ago
|
|
I don't think it's an insurmountable challenge. Tacotron 2 was trained on 24 hours of single-speaker transcribed audio, which is comparable to the freely-available LJ Speech Dataset. We know that it's feasible to train using unaligned transcribed speech, which broadens the opportunities to reuse an existing corpus. (Aside: does anyone know the legal status of training a DL model on copyrighted content?) Tacotron 2 was trained on a 32 GPU cluster, which is large but not absurdly so; we are seeing drastic performance increases in low-precision compute, which should hopefully start to trickle down as Volta reaches the market. Expertise is a bigger challenge at this stage, although that is progressively changing. A huge number of developers are taking a serious interest in deep learning, so hopefully we'll start to see more active contributors to DL FOSS projects. The teams at DeepMind and Baidu Research are clearly highly skilled but relatively small, which suggests that their efforts could be replicated by a small but determined team of FOSS developers. |
|