Hacker News new | ask | show | jobs
by psobot 3383 days ago
This is super cool. I'm trying to think of common practical applications for this - would one use this to sync a script with a performance? Could this remove a lot of the work required to manually subtitle movies, TV shows, and YouTube videos?
4 comments

Thank you.

Indeed several users of aeneas adopted it for producing SRT/TTML files, i.e. captions, for videos, both online and offline --- and many of them start with an existing transcript.

However, please note that there are limitations on the amount of "non speech" that aeneas can tolerate: for example, long spurious portions of audio or sung passages might affect the quality of the alignment.

For details on how aeneas works: https://github.com/readbeyond/aeneas/blob/master/wiki/HOWITW...

> there are limitations on the amount of "non speech" that aeneas can tolerate

couldn't you have as part of the input also a very simple map where users could define times that should be ignored to help with that? Might also be possible to look at the spectrum at any time to possibly identify areas of the file to skip.

And speaking about spectrum, just wondering, are you doing any pre-processing in terms of EQ (narrow band-pass on spoken frequencies), compression to not deal with volume, etc. to help with this also?

> Might also be possible to look at the spectrum at any time to possibly identify areas of the file to skip.

I would say yes and no.

Currently you can add a switch that makes aeneas ignore the audio intervals that are detected as "non speech" by the built-in Voice Activity Detector (VAD), which is a very rough energy-based VAD. For sure this is a part that can use some improvement.

However, AFAIK e.g. music/singing separation is a really difficult open problem, with people in academia doing PhDs on it. So, I am not sure how far one can push this line, while staying relatively fast on a regular machine. (Which is one of the goals of aeneas.)

> And speaking about spectrum, just wondering, are you doing any pre-processing in terms of EQ (narrow band-pass on spoken frequencies), compression to not deal with volume, etc. to help with this also?

Besides converting the input audio file to mono 16 kHz 16 bit WAVE, I do not perform any other operation on the audio data before passing it to the MFCC extractor (which by default runs with "standard" settings, but the user can change them).

Unfortunately, I have had no time to perform an exhaustive search of the parameter space, nor to try other pre-processing techniques.

But for sure if you have means to "pre-clean" the audio file before feeding it into aeneas, that is probably going to improve the quality of the output alignment.

(I did play with amplitude normalization and it did not seem to improve the results. The non-speech masking mentioned above seems beneficial if you do word-level alignment.)

Definitely.

Actually, aeneas can be used as a Python library (rather than just a CLI tool), and you can definitely provide an audio file, a list of audio intervals where the spoken text is, and align "piece-wise". See the "aeneas library tutorial" in the docs.

At the moment, the CLI tool aligns only a single audio interval (possibly chopping the head or the tail of the audio file) --- which is just a special case of the above case.

I remember a user requested this feature in the past. I have not added it yet because:

1. I have not heard much interest about it, and I have not needed it myself;

2. I am not satisfied with the current CLI interface --- (historical reasons mandated it the use of big config strings and strange, long parameter names) --- and hence I think that this kind of new features should be added once aeneas 2.x is out, with a redesigned CLI.

When I was an undergrad freshman, I took a job with a research group as a data annotator. My job was to go through the Switchboard corpus (recordings of hour-long phone calls that people agreed to have recorded, in exchange for having the long-distance charges paid) and label features such as who was speaking, whether the pitch of the voice was rising or falling, whether the vowels were elongated, vocal fry, and stuff like that.

But the most time-consuming and mind-numbing part of it was just annotating the words in the sound file.

The interface for all of this was a terrible GUI hacked in on top of some Solaris sound editor, and it couldn't do things for you like find the moments that words began, or say "hey the pitch is obviously falling here" because frequency tracking is a thing computers can do, or anything.

There's still a lot more voice data to annotate in the world, and maybe having a flexible Python tool like this will make the next undergrad doing the grunt work much more effective at it.

I agree on most of your observations.

However, please note that other tools are better suited than aeneas if one wants to align at phoneme level: gentle, Kaldi, SPPAS, etc.

aeneas' goals are covering as many languages as possible, fast computing, targeting (sub)sentence granularity (e.g., ebook-audiobook or closed captions). Phoneme-level annotation really requires more sophisticated techniques, like HMM/GMM/NN as implemented by the tools mentioned above. Yet, aeneas can be used to quickly bootstrap e.g. a manually-reviewed alignment.

There's nothing new about this, it's how speech recognition training data has been generated for a long time. Whether you can align a script will depend on how accurate it is and how expressive your models are for generating spoken/surface form alternatives for the ways things like dates are verbalized, which look different in text. If more than one person is speaking at the same time the results will be terrible.
I would like to note once again that aeneas is not based on automatic speech recognition techniques, but on MFCC + DTW, which is an even older approach, with pro's and con's.

Interestingly, there are situations where ASR-based forced aligners seem to be tricked into error, while aeneas handles them more robustly --- for example, if the speaker repeats a word in the spoken audio, but the transcript has only one occurrence, or when the speaker mumbles (uhm's, ah's, etc.). On the other hand, it is true that if you want word- or phoneme- alignment, ASR-based aligners outperform aeneas.

Finally, let me note three major goals of aeneas are: 1. be able to process hours of audio relatively fast on a standard PC (the current real time factor is between 0.008 and 0.020); 2. easy to install and run (unlike many other open source aligners derived from academic projects, which require a PhD just to get the dependencies right); and 3. working out-of-the-box for many languages, including ones that are not covered by academia or commercial solutions because they are "minor" (say, Icelandic or ancient Greek (!)).

But yes, the core algorithmic approach of aeneas has been around since the 1970s.

I had a vague plan to start working on something like this recently with the idea that I could automatically take audiobook media files and their accompanying ebook representation and use it to automatically re-divide the file by chapter (or using something based on chapter). Not sure if this will work well for that (or if my use is considered "common"), but I'm certainly glad to see it.
I have used aeneas myself to do it, with mixed results. You will probably need to increase the DTW margin. Also note that you will need a lot of RAM --- say 16 GB if you plan to work on a single audio file with duration 10-15 hours, which is typical for an audiobook.

In theory one can perform the DTW out-of-core, saving the accumulated cost matrix and path to disk, but I have had not time to implement this yet (i.e., the accumulated, reduced DTW cost matrix should fit into RAM). I tested it can be done with PyTables, but it will probably come with the next major version of aeneas (v2).

BTW, if your goal is to split, say, chapters of an audiobook, probably there are more efficient ways of doing this. For example, finding the long silence intervals between chapters might be enough. Or, instead of aligning all the text against all the audio, just perform a "partial matching" of the first sentences of each chapter against the audio.

Yes, thanks for the feedback! I wasn't planning on feeding it the entire audiobook and trying to align the whole thing (though there are other reasons you might want to do something like this) - I figured I'd use some heuristic methods to detect chapter breaks (like long silences), then try as you say partial matching to figure out which ones correspond to what chapters (or which ones correspond to chapters at all). Like I said, it was a vague plan, but when I've played around with running things through speech-to-text in the past I haven't had excellent results. I was hoping something like this (where you have the speech and the text and just want to know how they line up) would end up being much moreo accurate.
You are welcome.

Using a forced aligner usually improves the results a lot when compared to using an automatic speech recognition system --- because adapting the language model to your specific text prunes a lot of choices w.r.t. a generic language model which is supposed to cover any kind of text in that given language.

Anyway, if you feed aeneas an audio file < 2 hours, 4 GB of RAM should suffice, and the default parameters should be good as well. If you just need to recognize the splits doing a full alignment is an overkill, but I guess you will happy to "waste" 5 minutes of computation time instead of spending more time implementing your own code.