Hacker News new | ask | show | jobs
by alpe 3381 days ago
You are welcome.

Using a forced aligner usually improves the results a lot when compared to using an automatic speech recognition system --- because adapting the language model to your specific text prunes a lot of choices w.r.t. a generic language model which is supposed to cover any kind of text in that given language.

Anyway, if you feed aeneas an audio file < 2 hours, 4 GB of RAM should suffice, and the default parameters should be good as well. If you just need to recognize the splits doing a full alignment is an overkill, but I guess you will happy to "waste" 5 minutes of computation time instead of spending more time implementing your own code.