|
|
|
|
|
by hftf
3382 days ago
|
|
Do you know of any existing forced alignment tools that work well with live audio (microphone) input? I would like to create a live stream in which the words of a known text are displayed as they are being spoken into a microphone. |
|
But ASR-based tools in theory would allow such an operation mode, but I have not seen aligners that read from the mic buffer directly or have a built-in option/CLI for it.
Knowing the text in advance basically means that you can train your own language (textual) model adapted to that exact text, and then use the (standard) acoustic model for your language and aligning procedure as usual. Hence, I am quite sure you can tweak e.g. CMU Sphinx or Kaldi to do it. Perhaps gentle (which is based on Kaldi) is worth looking into.