To expand a little on this: The Echo has a 7-microphone array which is crucial to speech recognition accuracy. This gives it the best far-field recognition ability of any consumer product I've seen, with the ability to stay accurate even if you're across the room, with music playing. That's just the hardware, and replicating it's abilities will not be easy.
On the software side, supposedly they're using Nuance for recognition. Nuance isn't cutting edge: In the tests I've done, Nuance has a Word Error Rate (WER) that's 10%-20% higher than Google's, but it's still much better than something like Pocketsphinx or any other open source recognizer.
There are a lot of factors that go into making a speech interface a good experience for users: Good recognition accuracy even with background noise, good voice activity detection (even with background noise), very accurate word spotting, low latency. It's hard to hit all these things well enough to make the interface usable.
On the software side, supposedly they're using Nuance for recognition. Nuance isn't cutting edge: In the tests I've done, Nuance has a Word Error Rate (WER) that's 10%-20% higher than Google's, but it's still much better than something like Pocketsphinx or any other open source recognizer.
There are a lot of factors that go into making a speech interface a good experience for users: Good recognition accuracy even with background noise, good voice activity detection (even with background noise), very accurate word spotting, low latency. It's hard to hit all these things well enough to make the interface usable.