Hacker News new | ask | show | jobs
by hftf 3382 days ago
Do you know of any existing forced alignment tools that work well with live audio (microphone) input? I would like to create a live stream in which the words of a known text are displayed as they are being spoken into a microphone.
2 comments

For sure aeneas is not suitable, since it requires all the text and all the audio in advance.

But ASR-based tools in theory would allow such an operation mode, but I have not seen aligners that read from the mic buffer directly or have a built-in option/CLI for it.

Knowing the text in advance basically means that you can train your own language (textual) model adapted to that exact text, and then use the (standard) acoustic model for your language and aligning procedure as usual. Hence, I am quite sure you can tweak e.g. CMU Sphinx or Kaldi to do it. Perhaps gentle (which is based on Kaldi) is worth looking into.

I looked into gentle a few weeks ago and did notice that it seems to use an online algorithm. It doesn’t have built-in support for live audio input unfortunately, but it may be tweakable as you say (such as reimplementing it to use audio streams that work with either static or real-time input). I guess there’s no other way to find out than just try it myself.
Another possibility is to just run an automatic speech recognition system (e.g. Sphinx or PocketSphinx can read from the mic input), and align its output with the ground truth text.

You need to deal with imperfect matching because the ASR might produce a text slightly different from the ground truth, but if you want to chunk e.g. at sentence granularity (and then move on to the next sentence), you should be able to do it in real time.