| HN Mirror

> Of course, OCR and speech recognition are somewhat different fields, but if anyone is an expert for Chinese language, then it should be Baidu.

It's not about being experts in Mandarin. The basis of their approach is that it doesn't encompass any expert design. It's an end-to-end deep learning approach. From the article:

> Our system is different than that system in that it’s more what we call end-to-end. Rather than having a lot of human-engineered components that have been developed over decades of speech research — by looking at the system and saying what features are important or which phonemes the model should predict — we just have some input data, which is an audio .WAV file on which we do very little pre-processing. And then we have a big, deep neural network that outputs directly to characters. We give it enough data that it’s able to learn what’s relevant from the input to correctly transcribe the output, with as little human intervention as possible.

> One thing that’s pleasantly surprising to us is that we had to do very little changing to it — other than scaling it and giving it the right data — to make this system we showed in December that worked really well on English work remarkably well in Chinese, as well.

So they've quickly trained their Deep Speech engine [1] to process Mandarin after first training it to transcribe English, without injecting specific language expertise into the engine.

Finally, I strongly doubt the OCR and speech recognition teams are the same. I don't know about the OCR team but their speech recognition team is based in California [2] and includes Andrew Ng and Awni Hannun from Stanford University.

[1] http://arxiv.org/abs/1412.5567

[2] http://usa.baidu.com/deep-speech-lessons-from-deep-learning/