Hacker News new | ask | show | jobs
by alpe 3382 days ago
> Might also be possible to look at the spectrum at any time to possibly identify areas of the file to skip.

I would say yes and no.

Currently you can add a switch that makes aeneas ignore the audio intervals that are detected as "non speech" by the built-in Voice Activity Detector (VAD), which is a very rough energy-based VAD. For sure this is a part that can use some improvement.

However, AFAIK e.g. music/singing separation is a really difficult open problem, with people in academia doing PhDs on it. So, I am not sure how far one can push this line, while staying relatively fast on a regular machine. (Which is one of the goals of aeneas.)

> And speaking about spectrum, just wondering, are you doing any pre-processing in terms of EQ (narrow band-pass on spoken frequencies), compression to not deal with volume, etc. to help with this also?

Besides converting the input audio file to mono 16 kHz 16 bit WAVE, I do not perform any other operation on the audio data before passing it to the MFCC extractor (which by default runs with "standard" settings, but the user can change them).

Unfortunately, I have had no time to perform an exhaustive search of the parameter space, nor to try other pre-processing techniques.

But for sure if you have means to "pre-clean" the audio file before feeding it into aeneas, that is probably going to improve the quality of the output alignment.

(I did play with amplitude normalization and it did not seem to improve the results. The non-speech masking mentioned above seems beneficial if you do word-level alignment.)