Hacker News new | ask | show | jobs
by alpe 3382 days ago
aeneas is not based on ASR (i.e., it does not try to "recognize" words and align them with the input text), but on the "older" MFCC + DTW approach.

Hence, it is difficult to give you a precise answer, e.g. in terms of word-error-rate or similar metrics.

For the task aeneas has been designed for --- aligning an ebook and the corresponding audiobook --- and for similar tasks (e.g., captioning videos of lectures or spoken-only content), it generally produces an alignment that is indistinguishable from a manually-produced one.

If you want to see some examples, read+listen one of these audio-ebooks: the alignment has been produced by aeneas: https://www.readbeyond.it/ebooks.html

But of course if you want to align at finer level (word) or a more noisy/non-matching audio, the quality of the alignment can deteriorate.

1 comments

Thanks for the explanation. Will it work if there are gaps in the transcript? Eg, the clean verbatim transcript where the ah's and uhm's are left out.
Several users of aeneas interested in producing caption files for videos told me that it does. And considering how DTW works, it is plausible.

Unfortunately, I have not had the time to setting up a suitable corpus and performing a rigorous evaluation to comfortably answering your question with a definitive answer "yes".

Perhaps the best option to see if aeneas works for your use case, consists in trying it out.

If you do not want to install anything on your machine, you can use the aeneas Web app: https://aeneasweb.org --- basically you submit an audio file (or a YouTube URL) and a text file, and get a SRT/TTML/etc. file emailed back.

I definitely plan to try it soon.