|
|
|
|
|
by cityhall
3386 days ago
|
|
There's nothing new about this, it's how speech recognition training data has been generated for a long time. Whether you can align a script will depend on how accurate it is and how expressive your models are for generating spoken/surface form alternatives for the ways things like dates are verbalized, which look different in text. If more than one person is speaking at the same time the results will be terrible. |
|
Interestingly, there are situations where ASR-based forced aligners seem to be tricked into error, while aeneas handles them more robustly --- for example, if the speaker repeats a word in the spoken audio, but the transcript has only one occurrence, or when the speaker mumbles (uhm's, ah's, etc.). On the other hand, it is true that if you want word- or phoneme- alignment, ASR-based aligners outperform aeneas.
Finally, let me note three major goals of aeneas are: 1. be able to process hours of audio relatively fast on a standard PC (the current real time factor is between 0.008 and 0.020); 2. easy to install and run (unlike many other open source aligners derived from academic projects, which require a PhD just to get the dependencies right); and 3. working out-of-the-box for many languages, including ones that are not covered by academia or commercial solutions because they are "minor" (say, Icelandic or ancient Greek (!)).
But yes, the core algorithmic approach of aeneas has been around since the 1970s.