|
|
|
|
|
by adeptima
459 days ago
|
|
Accurate word timestamps seems an overhead and required a post processing like forced alignment (speech technique that can automatically align audio files with transcripts) Had a recent dive into a forced alignment, and discovered that most of new models dont operate on word boundaries, phoneme, etc but rather chunk audio with overlap and do word, context matching. Older HHM-style models have shorter strides (10ms vs 20ms). Tried to search into Kaldi/Sherpa ecosystem, and found most info leads to nowhere or very small and inaccurate models. Appreciate any tips on the subject |
|