Hacker News new | ask | show | jobs
by alpe 3382 days ago
Thank you.

Indeed several users of aeneas adopted it for producing SRT/TTML files, i.e. captions, for videos, both online and offline --- and many of them start with an existing transcript.

However, please note that there are limitations on the amount of "non speech" that aeneas can tolerate: for example, long spurious portions of audio or sung passages might affect the quality of the alignment.

For details on how aeneas works: https://github.com/readbeyond/aeneas/blob/master/wiki/HOWITW...

1 comments

> there are limitations on the amount of "non speech" that aeneas can tolerate

couldn't you have as part of the input also a very simple map where users could define times that should be ignored to help with that? Might also be possible to look at the spectrum at any time to possibly identify areas of the file to skip.

And speaking about spectrum, just wondering, are you doing any pre-processing in terms of EQ (narrow band-pass on spoken frequencies), compression to not deal with volume, etc. to help with this also?

> Might also be possible to look at the spectrum at any time to possibly identify areas of the file to skip.

I would say yes and no.

Currently you can add a switch that makes aeneas ignore the audio intervals that are detected as "non speech" by the built-in Voice Activity Detector (VAD), which is a very rough energy-based VAD. For sure this is a part that can use some improvement.

However, AFAIK e.g. music/singing separation is a really difficult open problem, with people in academia doing PhDs on it. So, I am not sure how far one can push this line, while staying relatively fast on a regular machine. (Which is one of the goals of aeneas.)

> And speaking about spectrum, just wondering, are you doing any pre-processing in terms of EQ (narrow band-pass on spoken frequencies), compression to not deal with volume, etc. to help with this also?

Besides converting the input audio file to mono 16 kHz 16 bit WAVE, I do not perform any other operation on the audio data before passing it to the MFCC extractor (which by default runs with "standard" settings, but the user can change them).

Unfortunately, I have had no time to perform an exhaustive search of the parameter space, nor to try other pre-processing techniques.

But for sure if you have means to "pre-clean" the audio file before feeding it into aeneas, that is probably going to improve the quality of the output alignment.

(I did play with amplitude normalization and it did not seem to improve the results. The non-speech masking mentioned above seems beneficial if you do word-level alignment.)

Definitely.

Actually, aeneas can be used as a Python library (rather than just a CLI tool), and you can definitely provide an audio file, a list of audio intervals where the spoken text is, and align "piece-wise". See the "aeneas library tutorial" in the docs.

At the moment, the CLI tool aligns only a single audio interval (possibly chopping the head or the tail of the audio file) --- which is just a special case of the above case.

I remember a user requested this feature in the past. I have not added it yet because:

1. I have not heard much interest about it, and I have not needed it myself;

2. I am not satisfied with the current CLI interface --- (historical reasons mandated it the use of big config strings and strange, long parameter names) --- and hence I think that this kind of new features should be added once aeneas 2.x is out, with a redesigned CLI.