Hacker News new | ask | show | jobs
by alpe 3383 days ago
For sure aeneas is not suitable, since it requires all the text and all the audio in advance.

But ASR-based tools in theory would allow such an operation mode, but I have not seen aligners that read from the mic buffer directly or have a built-in option/CLI for it.

Knowing the text in advance basically means that you can train your own language (textual) model adapted to that exact text, and then use the (standard) acoustic model for your language and aligning procedure as usual. Hence, I am quite sure you can tweak e.g. CMU Sphinx or Kaldi to do it. Perhaps gentle (which is based on Kaldi) is worth looking into.

1 comments

I looked into gentle a few weeks ago and did notice that it seems to use an online algorithm. It doesn’t have built-in support for live audio input unfortunately, but it may be tweakable as you say (such as reimplementing it to use audio streams that work with either static or real-time input). I guess there’s no other way to find out than just try it myself.