You'd imagine, but as someone that's tried, voice recognition is just one part of it, and is a rather hard problem in terms of computing power required. Note that the linked DeepSpeech stuff is a tensorflow based solution, and only hits a xRT of 0.44 on a GTX 1070. While that is slightly better then twice realtime, I really doubt anything less powerful then a handful of years old GPU is going to pull it off in realtime, and def. not a rPI or similar.
edit: follow up clarification, the echo is NOT doing voice processing on the device itself, it ships it up to the cloud to do so. You could of course set up something similar to that using a raspberry pi and shipping audio to your desktop to be processed.
Kinda. The issue is the microphone. It's a specialized piece of hardware that uses a microphone array and also has wake word logic built in. There are only a couple companies that sell this and they don't sell one offs. But maybe in the future.
Wake word logic is not the hard part, you can do it in software relatively cheaply (cmusphinx/pocketsphinx handles wake word type logic in realtime with relatively low cpu requirements) Microphone quality is an issue, but there's stuff out there, it's just a matter of finding a good one. One popular one is the playstation eye camera, as it's really freaking cheap and is designed for speech input at medium distances, but it's not omnidirectional like the echo/google home mini kind of thing is.
edit: follow up clarification, the echo is NOT doing voice processing on the device itself, it ships it up to the cloud to do so. You could of course set up something similar to that using a raspberry pi and shipping audio to your desktop to be processed.