The echo has a separate processor that listens for the wake word; when it hears the wake word it fires up the main processor to start doing the actual audio processing. Some simple commands are processed directly on the Echo without going to Amazon's servers but the rest are sent over the Internet to be processed.
From my experience reverse engineering the first generation Echo, there was no coprocessor. However, wake word detection was done offline. There was a software controlled hardware switch to disconnect the microphone when it was muted.
I don't know how accurate this is but this what I found on how the Echo works:
"
Echo is built on Texas Instruments DM3725 Digital Media Processor.
This TI SoC has two key pieces inside, first is ARM Cortex-A8 MPU, and the second one is TMS320DM64x+ DSP. The ARM core should be running Linux and the DSP is running firmware.
When idling, the ARM core is taken to lowest possible power state and Linux is completely suspended. At this time the DSP and 64KB On-Chip RAM are active. The DSP firmware processes noise coming in from the mics and attempts to identify if a keyword (e.g., Alexa) is spoken. As soon as it identifies there's a keyword, DSP sends an interrupt to wake up the ARM core which in turn resumes Linux.