Hacker News new | ask | show | jobs
by cmicali 5110 days ago
Vlingo, Siri, and others have been doing speaker independent auto-adapting speech recognition for years and talking about systems requiring 'training' and improvements there sound like this article is 5 years old. Great to see innovation in this space but this article is very light on detail.
3 comments

It is my understanding (albeit based on limited knowledge) that Siri, like other Nuance-powered systems that make a call to the server, are actually "trained" continuously by the huge amount of sample speech they receive by real users.

The true "breakthrough" here would be if Microsoft made a voice recognition system that could run entirely on a device (no internet connection needed) and accurately understand speech without terabytes of training data or a local user training session. I can't tell from the article if this is what Microsoft is claiming.

Also, it appears that "Deep Neural Network" isn't the most common term of art here. DNN appears to be a synonym for "Deep Belief Network".[1] Can anyone confirm?

[1] http://www.scholarpedia.org/article/Deep_belief_networks

I believe that in this system, "deep neural network" just means a regular feed-forward network that has a larger number of hidden layers. There is a relationship to DBNs though, because they initialize the weights of the neural net by doing unsupervised pre-training with a set of DBNs.
The term "Deep Belief Network" has been abused in the literature (not pointing fingers, I've done it too). The DNNs used mean a neural net pre-trained with RBMs. Sometimes, when people say DBN, that is also what they mean. But really a DBN is a particular graphical model with undirected connections between the top two layers and directed connections everywhere else. The confusion comes from the pre-training procedure. The pre-training creates a DBN, which is then used to initialize the weights of a standard feedforward neural net. Then the DBN is discarded. It is a somewhat pedantic distinction. Since DBN is already an overloaded acronym (Dynamic Bayes Net) in the speech community and not entirely accurate for the pedantic reason I just mentioned, we decided to go with the DNN acronym.
As you might guess, they are not claiming this.

They basically are using a new (in the context of speech rec) technique that seems to improve accuracy by 16% relative on their test data (and using their code :-)). It's a really great result, but it doesn't change the basic nature of a state of the art speech recognizer at all -- you still need to train and adapt it -- and it still needs lots and lots of data.

Vlingo has LITERALLY never gotten anything I said right, ever. Just a data point.
That jumped out at me as well. Speaker-independent systems are most certainly not limited to small vocabularies or pre-baked input patterns anymore. There is certainly room for a great deal of improvement, but it's in accuracy, not simply the ability to do generalized speaker-independent input at all.