Hacker News new | ask | show | jobs
by eginhard 3182 days ago
Recordings are force-aligned to the transcriptions anyway (using essentially a speech recognition system) to obtain phone-level alignments. You don't need explicit timing information beforehand.