Hacker News new | ask | show | jobs
by epmaybe 3176 days ago
What happens if the CC isn't entirely in sync with the video or audio?
2 comments

Strictly speaking, the text is never 'entirely in sync' because spoken words inherently blur together and are seamless; individual letters in the text do not start and end at precise intervals. This is one of the things that makes speech recognition so hard: letters, syllables, and words do not really exist as discrete things on the raw audio level. So this problem exists for any speech transcription dataset. To provide a loss function, then, you would use something like CTC: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=FD2... Fortunately, NNs are good at handling noisy data, and in practice they work very well for speech recognition/transcription.
Recordings are force-aligned to the transcriptions anyway (using essentially a speech recognition system) to obtain phone-level alignments. You don't need explicit timing information beforehand.