|
|
|
|
|
by gliese1337
2299 days ago
|
|
The way I'd do it is to use an existing speech recognition system with a large number of language models available (like CMU Sphinx--but probably not CMU Sphinx, 'cause I don't think there are decent openly-available models for 108 different language for Sphinx; maybe MicroSoft's Azure speech to text API or IBM's Watson speech recognition or something like that) to produce a rough transcript with timecodes, and then meet in the middle--use the timecodes from speech recognition, and the known-good text from whatever lyrics you already found, and reduce it to a text-to-text alignment problem so you can match up the ASR timecodes to the known-good text. First pass, I'd probably try an LCS match on the two text streams, but if that wasn't good enough, I'm sure there are better algorithms in the bioinformatics literature. |
|