|
|
|
|
|
by gliese1337
2299 days ago
|
|
Alternately, since you say speech recognition isn't "even close", I might try going the other way--doing text-to-speech on the audio stream, attempting to align the two speech tracks, and the back-porting the timecodes from audio alignment onto the text. But that seems a lot more complicated... so, unlikely. A way to cheat that would probably work good enough most of the time would be to spectrographic analysis on the audio stream to identify syllables, and then similarly just count syllables in the known text and line those up. That works better the more consistent your spelling system is, though, and still requires language-specific modelling. If you actually want to do a decent job cross-linguistically, you'd need in the general case a dictionary for every supported language listing syllable counts for each word (because not everybody's orthography is transparent enough to make simple models like counting character sequences work). If you actually have a fully language-agnostic algorithm for aligning text to audio that's actually decently accurate, though, that's gotta be worth at least a Master's degree in computational linguistics, 'cause on the face of it it doesn't seem to me (who has such a Masters degree) that it should even theoretically be possible. |
|