| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nshm 4772 days ago

There are so many toys like this out there. It's very simple to hardcode few simple commands in python script and show that as an advance in artifical intelligence. Some examples are

Simon https://sourceforge.net/projects/speech2text/ Vedics http://sourceforge.net/projects/vedics/ Palaver https://github.com/JamezQ/Palaver Voicekey http://sourceforge.net/projects/voicekey/

I could add 10 others. Unfortunately all such projects are practically unusable for a wide audience just because they lack very important features and provide only basic functions. The problem is that advanced development require understanding of speech recognition internals, close cooperation with engine developers and hard user interface testing.

Here is the list of features one have to implement to provide even basic user experience:

1. Implement and test proper keyword spotting for voice activation. This feature is missing in Julius engine and all workarounds which are used do not provide enough accuracy.

2. Implement invisible speaker detection/online adaptation to deal with environment issues. With adaptation accuracy is extremely high for a few speakers. You can easily dictate free form speech.

3. Implement free-form speech. This part require data collection and native language understanding work.

4. Implement a framework to detect microphone and noise issues. Microphone issues like clipping are a major source of accuracy problems in speech recognition. Most engines do not care about proper microphone.

5. Provide an easy way to add new commands with no training and no knowledge of phonetics. This reqiures G2P component which converts unknown words to phonemes like the one provided by CMUSphinx.

6. Implement dialog management and error correction. So that user can not just run queries but have a conversation with the application. Interstingly, that requires update of the previous engine results and analysis of n-best lists or lattices.

So I hope someone would start doing real work on speech interfaces with close cooperation with engine developers. That would create amazing things to demonstrate.