Hacker News new | ask | show | jobs
by x1798DE 3382 days ago
I had a vague plan to start working on something like this recently with the idea that I could automatically take audiobook media files and their accompanying ebook representation and use it to automatically re-divide the file by chapter (or using something based on chapter). Not sure if this will work well for that (or if my use is considered "common"), but I'm certainly glad to see it.
1 comments

I have used aeneas myself to do it, with mixed results. You will probably need to increase the DTW margin. Also note that you will need a lot of RAM --- say 16 GB if you plan to work on a single audio file with duration 10-15 hours, which is typical for an audiobook.

In theory one can perform the DTW out-of-core, saving the accumulated cost matrix and path to disk, but I have had not time to implement this yet (i.e., the accumulated, reduced DTW cost matrix should fit into RAM). I tested it can be done with PyTables, but it will probably come with the next major version of aeneas (v2).

BTW, if your goal is to split, say, chapters of an audiobook, probably there are more efficient ways of doing this. For example, finding the long silence intervals between chapters might be enough. Or, instead of aligning all the text against all the audio, just perform a "partial matching" of the first sentences of each chapter against the audio.

Yes, thanks for the feedback! I wasn't planning on feeding it the entire audiobook and trying to align the whole thing (though there are other reasons you might want to do something like this) - I figured I'd use some heuristic methods to detect chapter breaks (like long silences), then try as you say partial matching to figure out which ones correspond to what chapters (or which ones correspond to chapters at all). Like I said, it was a vague plan, but when I've played around with running things through speech-to-text in the past I haven't had excellent results. I was hoping something like this (where you have the speech and the text and just want to know how they line up) would end up being much moreo accurate.
You are welcome.

Using a forced aligner usually improves the results a lot when compared to using an automatic speech recognition system --- because adapting the language model to your specific text prunes a lot of choices w.r.t. a generic language model which is supposed to cover any kind of text in that given language.

Anyway, if you feed aeneas an audio file < 2 hours, 4 GB of RAM should suffice, and the default parameters should be good as well. If you just need to recognize the splits doing a full alignment is an overkill, but I guess you will happy to "waste" 5 minutes of computation time instead of spending more time implementing your own code.