| HN Mirror

I would like to note once again that aeneas is not based on automatic speech recognition techniques, but on MFCC + DTW, which is an even older approach, with pro's and con's.

Interestingly, there are situations where ASR-based forced aligners seem to be tricked into error, while aeneas handles them more robustly --- for example, if the speaker repeats a word in the spoken audio, but the transcript has only one occurrence, or when the speaker mumbles (uhm's, ah's, etc.). On the other hand, it is true that if you want word- or phoneme- alignment, ASR-based aligners outperform aeneas.

Finally, let me note three major goals of aeneas are: 1. be able to process hours of audio relatively fast on a standard PC (the current real time factor is between 0.008 and 0.020); 2. easy to install and run (unlike many other open source aligners derived from academic projects, which require a PhD just to get the dependencies right); and 3. working out-of-the-box for many languages, including ones that are not covered by academia or commercial solutions because they are "minor" (say, Icelandic or ancient Greek (!)).

But yes, the core algorithmic approach of aeneas has been around since the 1970s.