There are so many toys like this out there. It's very simple to hardcode few simple commands in python script and show that as an advance in artifical intelligence. Some examples are
I could add 10 others. Unfortunately all such projects are practically unusable for a wide audience just because they lack very important features and provide only basic functions. The problem is that advanced development require understanding of speech recognition internals, close cooperation with engine developers and hard user interface testing.
Here is the list of features one have to implement to provide even basic user experience:
1. Implement and test proper keyword spotting for voice activation. This feature is missing in Julius engine and all workarounds which are used do not provide enough accuracy.
2. Implement invisible speaker detection/online adaptation to deal with environment issues. With adaptation accuracy is extremely high for a few speakers. You can easily dictate free form speech.
3. Implement free-form speech. This part require data collection and native language understanding work.
4. Implement a framework to detect microphone and noise issues. Microphone issues like clipping are a major source of accuracy problems in speech recognition. Most engines do not care about proper microphone.
5. Provide an easy way to add new commands with no training and no knowledge of phonetics. This reqiures G2P component which converts unknown words to phonemes like the one provided by CMUSphinx.
6. Implement dialog management and error correction. So that user can not just run queries but have a conversation with the application. Interstingly, that requires update of the previous engine results and analysis of n-best lists or lattices.
So I hope someone would start doing real work on speech interfaces with close cooperation with engine developers. That would create amazing things to demonstrate.
What I actually want to build is multi-backend. It's running off of CMUsphinx right now (which was a ridiculous pain to build, but also taught me a lot about the things that power it), but I want the audio backends to be switchable at the change of some strings in a config file, and of course, the proliferation to more platforms.
The output/initialization/running of julius and cmusphinx are very very similar.
I do think CMU Sphinx is pretty good as an always on listener -- I've had times where it would suddenly die after running for hours (which I tried to debug) -- and times that it would run fine after me playing hours of music non-stop.
Simon https://sourceforge.net/projects/speech2text/ Vedics http://sourceforge.net/projects/vedics/ Palaver https://github.com/JamezQ/Palaver Voicekey http://sourceforge.net/projects/voicekey/
I could add 10 others. Unfortunately all such projects are practically unusable for a wide audience just because they lack very important features and provide only basic functions. The problem is that advanced development require understanding of speech recognition internals, close cooperation with engine developers and hard user interface testing.
Here is the list of features one have to implement to provide even basic user experience:
1. Implement and test proper keyword spotting for voice activation. This feature is missing in Julius engine and all workarounds which are used do not provide enough accuracy.
2. Implement invisible speaker detection/online adaptation to deal with environment issues. With adaptation accuracy is extremely high for a few speakers. You can easily dictate free form speech.
3. Implement free-form speech. This part require data collection and native language understanding work.
4. Implement a framework to detect microphone and noise issues. Microphone issues like clipping are a major source of accuracy problems in speech recognition. Most engines do not care about proper microphone.
5. Provide an easy way to add new commands with no training and no knowledge of phonetics. This reqiures G2P component which converts unknown words to phonemes like the one provided by CMUSphinx.
6. Implement dialog management and error correction. So that user can not just run queries but have a conversation with the application. Interstingly, that requires update of the previous engine results and analysis of n-best lists or lattices.
So I hope someone would start doing real work on speech interfaces with close cooperation with engine developers. That would create amazing things to demonstrate.