Hacker News new | ask | show | jobs
by tymekpavel 3175 days ago
I've always wondered why companies don't just take all the close-captioned TV streams and use that as training data for their voice models. Seems like it would create a much more natural sounding voice model (at least as far as humans are accustomed to).
3 comments

I think a better data source would be professionally produced audiobooks. Better, clearer voices, nearly perfect "transcription", enormous supply already recorded on a wide range of topics potentially available from one or two sources, etc.
Even if there is no background noise present, the quality is nowhere near that of professional studio recordings and would be very noticeable in the output.

Also, for traditional systems you need a lot of data from one speaker only, they can't take advantage of other speakers' recordings (although WaveNet does that now).

And "TV natural" might not be the style of natural you want from a TTS system.

What happens if the CC isn't entirely in sync with the video or audio?
Strictly speaking, the text is never 'entirely in sync' because spoken words inherently blur together and are seamless; individual letters in the text do not start and end at precise intervals. This is one of the things that makes speech recognition so hard: letters, syllables, and words do not really exist as discrete things on the raw audio level. So this problem exists for any speech transcription dataset. To provide a loss function, then, you would use something like CTC: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=FD2... Fortunately, NNs are good at handling noisy data, and in practice they work very well for speech recognition/transcription.
Recordings are force-aligned to the transcriptions anyway (using essentially a speech recognition system) to obtain phone-level alignments. You don't need explicit timing information beforehand.