| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tymekpavel 3222 days ago
	I've always wondered why companies don't just take all the close-captioned TV streams and use that as training data for their voice models. Seems like it would create a much more natural sounding voice model (at least as far as humans are accustomed to).

3 comments

SiVal 3222 days ago

I think a better data source would be professionally produced audiobooks. Better, clearer voices, nearly perfect "transcription", enormous supply already recorded on a wide range of topics potentially available from one or two sources, etc.

link

eginhard 3222 days ago

Even if there is no background noise present, the quality is nowhere near that of professional studio recordings and would be very noticeable in the output.

Also, for traditional systems you need a lot of data from one speaker only, they can't take advantage of other speakers' recordings (although WaveNet does that now).

And "TV natural" might not be the style of natural you want from a TTS system.

link

epmaybe 3222 days ago

What happens if the CC isn't entirely in sync with the video or audio?

link

gwern 3222 days ago

Strictly speaking, the text is never 'entirely in sync' because spoken words inherently blur together and are seamless; individual letters in the text do not start and end at precise intervals. This is one of the things that makes speech recognition so hard: letters, syllables, and words do not really exist as discrete things on the raw audio level. So this problem exists for any speech transcription dataset. To provide a loss function, then, you would use something like CTC: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=FD2... Fortunately, NNs are good at handling noisy data, and in practice they work very well for speech recognition/transcription.

link

eginhard 3222 days ago

Recordings are force-aligned to the transcriptions anyway (using essentially a speech recognition system) to obtain phone-level alignments. You don't need explicit timing information beforehand.

link