Hacker News new | ask | show | jobs
by daanzu 2203 days ago
Windows Speech Recognition is far from the best, so perhaps your trouble could be partly caused by how you had to speak in order to be understood, rather than the command style? I used to use WSR to code by voice, and it was far more laborious than my current setup.

I develop kaldi-active-grammar [0]. The Kaldi engine is state of the art for command and control. Although I don't have the data and resources for training a model like Microsoft/Nuance/Google, being an open rather than closed system allows me to train models that are far more personalized than the large commercial/generic ones you are used to. For example, see the video of me using it [1], where I can speak in a relaxed manner without having to over enunciate and strain my voice.

Gathering the data for such training does take some time, but the results can be huge [2]. Performing the actual training is currently complicated; I am working on making it portable and more turnkey, but it's not ready yet. However, I am running test training for some people. Contact me if you want me to use you as a guinea pig.

[0] https://github.com/daanzu/kaldi-active-grammar

[1] https://youtu.be/Qk1mGbIJx3s

[2] https://github.com/daanzu/kaldi-active-grammar/blob/master/d...

2 comments

> Performing the actual training is currently complicated; I am working on making it portable and more turnkey, but it's not ready yet

I'm eagerly awaiting this! If I wanted to try to get something working now, I'd need to invest a lot of time - being able to get started quickly would be amazing.

It looks like Kaldi can use different backends, which I imagine have very different performance characteristics. Can you rank them from best to worst, with relative distances?
Just to be clear, the Dragonfly speech recognition command and control framework has multiple "backends" (speech recognition engines), including my Kaldi one. Probably the most used one currently is the Dragon Naturally Speaking backend.

The Kaldi engine, being developed primarily for research in speech recognition, can support a huge variety of "models". I think the consensus general best for most use cases (particularly for real time, low latency, streaming use) currently would be considered to be the "nnet3 chain" models, which are what my kaldi-active-grammar uses/supports.

Thank you, I think I understand partially, but not fully, as I'm not very well versed in speech recognition software.

Basically, my question (and I assume many other users') is "I run <Linux/Windows/Mac OS>, what are my options and how good will my recognition be with each?". Your answer above helps, but it doesn't entirely satisfy me, as I'm not sure if a model is the recognition engine, or if the engine uses the model, or how I can use it, etc.