aeneas is not based on ASR (i.e., it does not try to "recognize" words and align them with the input text), but on the "older" MFCC + DTW approach.
Hence, it is difficult to give you a precise answer, e.g. in terms of word-error-rate or similar metrics.
For the task aeneas has been designed for --- aligning an ebook and the corresponding audiobook --- and for similar tasks (e.g., captioning videos of lectures or spoken-only content), it generally produces an alignment that is indistinguishable from a manually-produced one.
If you want to see some examples, read+listen one of these audio-ebooks: the alignment has been produced by aeneas: https://www.readbeyond.it/ebooks.html
But of course if you want to align at finer level (word) or a more noisy/non-matching audio, the quality of the alignment can deteriorate.
Several users of aeneas interested in producing caption files for videos told me that it does. And considering how DTW works, it is plausible.
Unfortunately, I have not had the time to setting up a suitable corpus and performing a rigorous evaluation to comfortably answering your question with a definitive answer "yes".
Perhaps the best option to see if aeneas works for your use case, consists in trying it out.
If you do not want to install anything on your machine, you can use the aeneas Web app: https://aeneasweb.org --- basically you submit an audio file (or a YouTube URL) and a text file, and get a SRT/TTML/etc. file emailed back.
Hence, it is difficult to give you a precise answer, e.g. in terms of word-error-rate or similar metrics.
For the task aeneas has been designed for --- aligning an ebook and the corresponding audiobook --- and for similar tasks (e.g., captioning videos of lectures or spoken-only content), it generally produces an alignment that is indistinguishable from a manually-produced one.
If you want to see some examples, read+listen one of these audio-ebooks: the alignment has been produced by aeneas: https://www.readbeyond.it/ebooks.html
But of course if you want to align at finer level (word) or a more noisy/non-matching audio, the quality of the alignment can deteriorate.