Hacker News new | ask | show | jobs
by breckinloggins 5104 days ago
It is my understanding (albeit based on limited knowledge) that Siri, like other Nuance-powered systems that make a call to the server, are actually "trained" continuously by the huge amount of sample speech they receive by real users.

The true "breakthrough" here would be if Microsoft made a voice recognition system that could run entirely on a device (no internet connection needed) and accurately understand speech without terabytes of training data or a local user training session. I can't tell from the article if this is what Microsoft is claiming.

Also, it appears that "Deep Neural Network" isn't the most common term of art here. DNN appears to be a synonym for "Deep Belief Network".[1] Can anyone confirm?

[1] http://www.scholarpedia.org/article/Deep_belief_networks

3 comments

I believe that in this system, "deep neural network" just means a regular feed-forward network that has a larger number of hidden layers. There is a relationship to DBNs though, because they initialize the weights of the neural net by doing unsupervised pre-training with a set of DBNs.
The term "Deep Belief Network" has been abused in the literature (not pointing fingers, I've done it too). The DNNs used mean a neural net pre-trained with RBMs. Sometimes, when people say DBN, that is also what they mean. But really a DBN is a particular graphical model with undirected connections between the top two layers and directed connections everywhere else. The confusion comes from the pre-training procedure. The pre-training creates a DBN, which is then used to initialize the weights of a standard feedforward neural net. Then the DBN is discarded. It is a somewhat pedantic distinction. Since DBN is already an overloaded acronym (Dynamic Bayes Net) in the speech community and not entirely accurate for the pedantic reason I just mentioned, we decided to go with the DNN acronym.
As you might guess, they are not claiming this.

They basically are using a new (in the context of speech rec) technique that seems to improve accuracy by 16% relative on their test data (and using their code :-)). It's a really great result, but it doesn't change the basic nature of a state of the art speech recognizer at all -- you still need to train and adapt it -- and it still needs lots and lots of data.