Hacker News new | ask | show | jobs
by gr4yb34rd 1193 days ago
I doubt the guts of something like an Echo would be capable of containing any kind of speech to text model. It'll probably be a lot more of a concern in a few years when speech models like that are getting baked into cheap chips for consumer devices, though.
3 comments

I'm not as doubtful: iPhones have had on-device speech recognition since at least 2019, with iOS 13[1]. Amazon might have legal or policy reasons for not doing offline speech recognition, but the technology has been there for a while.

[1]: https://developer.apple.com/documentation/speech/sfspeechrec...

Has the tech been available in a $29 device that also includes a speaker and a microphone array?
As other have noted these devices are likely loss leaders.

That said, pretty much ($35 with shipping qty 1):

ESP32-Korvo Development Board https://a.co/d/fmaQfuL

This is a many years old dev board with Wi-Fi, Bluetooth, a multi-mic ring array, RBG LEDs, speaker amplifier, SD slot, etc. Even this old version supports wake word with ring buffer, etc. That said this is also a (fun!) dev board - not a finished product. Newer versions based on the S3 are much more powerful.

Teardowns of Echo devices show much more capable hardware but at Amazon’s scale with their engineering, supply chain, and manufacturing resources they’re probably not losing all that much on Echo devices - although the Echo unit at Amazon has been posting serious losses but I’ve never seen breakouts of where those losses are coming from.

$29 is the sale price, not the BOM price. The entire point of "smart" home devices is to reduce consumer friction around revenue-driving actions; it would be surprising if Amazon were to prioritize making a profit on a $29 device rather than the sales and subscriptions it facilitates.
I don't know if it's every smart home vendor's goal to drive revenue. Apple sells their smart speakers for $100 or $300, which doesn't seem like a "loss leader" price, and it doesn't ask you to buy anything else. Their marketing page mentions what it can integrate with (smart locks, smart lights, etc.) but those things are notably not subscriptions. If you buy a smart light switch and a smart speaker, then you can just say "Hey Siri turn off the lights in the kitchen" and nobody ever gets any more money, and you didn't have to go downstairs and turn off the lights in the kitchen.

This seems like a very justified smart home. I am skeptical of all things proprietary technology, but Apple's stuff bothers me the least.

You can run whisper.cpp (a very solid TTS) on a Raspberry Pi (https://github.com/ggerganov/whisper.cpp/discussions/166), so I'm pretty sure with a modest CPU upgrade an Echo would be more than capable of running it locally.
I’m a big fan of Whisper and whisper.cpp but doing reliable wake word detection with any kind of reasonable latency on a Raspberry Pi is likely to be a poor fit and very bad experience.

The Whisper model operates on 30 sec speech chunks. Input audio has to be padded to that length. So you’re constantly going to be recording audio, padding, looking for wake word, and then activating full recording upon detection. All on padded 30 sec chunks looking back…

Then there is model size and availability. Whisper base or maybe even tiny could potentially give decent results for wake word detection but I’m skeptical. Wake words can be surprisingly tricky.

That’s just for wake word assuming you’re going to stream audio after, as reliably doing ASR and NLP to figure out speaker intent is far too challenging and time consuming to be done on Raspberry Pi class hardware in anything approaching response times that would be considered acceptable. Whisper does pretty well with relatively high noise/low quality speech and far-field microphones are amazing but I doubt that's enough to provide anything approaching Echo/Alexa quality in the real world.

This is a carefully scripted demo[0] showing it takes a whopping 15 seconds to wake, ASR the speech, and return the result. The average person could easily take their phone out of their pocket, unlock it, look for the weather app, and read the weather in less time.

This demo[1] claims "real time" but looking at the example videos it clearly isn't and the accuracy leaves a lot to be desired. This is with three threads on a Raspberry Pi 4.

I just tried asking Echo what the weather is like and it was so fast I had trouble timing it - somewhere around one second.

[0] - https://youtube.com/watch?v=Aor6CFkcWzU&si=EnSIkaIECMiOmarE

[1] - https://github.com/ggerganov/whisper.cpp/discussions/166

Recently I made some progress on efficiently detecting short voice commands (wake words) on RPi4 [0]. Checkout the "command" example in whisper.cpp and it's "Guided mode" operation. There are additional improvements on the way too.

[0] https://twitter.com/ggerganov/status/1602759833312456704

As I said I really appreciate and respect all of the work you’re doing on whisper.cpp but when it comes to things like wake words and commands I have to think Whisper is just fundamentally the wrong tool for the job. Tiny is a 39m parameter model with fairly poor accuracy and high latency (without GPU) that just about maxes out a Raspberry Pi - all for a few a few very carefully pronounced words under ideal conditions (in this case).

That said, there isn’t much in the open source space (that I’ve found) that’s even remotely competitive with Alexa/Echo so I’m all for any efforts and attention in this area. Perfect is the enemy of good but this thread started off with people wondering if Whisper or anything based on it was close to Alexa/Echo for wake word activated assistant tasks. I think it’s very safe to say it isn’t.

Again, I really appreciate your work on making Whisper more accessible to the masses for local ASR - please don’t take this as criticism for your efforts. If anything I’ve been involved in open source projects and it’s frustrating when people try to jam a square peg in a round hole, only to come back and complain your labor of love didn’t work for them.

True, but note that they're using the tiny model. In my experimentation, you need at least the small model to get transcription I'd call "good", which is still a bit slower than you'd like on a moderately fast laptop from 2019.

That said, whisper is incredible and the era of very good local speech-to-text on moderate hardware is basically here, or will be in the next year.

Early versions of Dragon Naturally Speaking ran on Pentiums.
Dragon Naturally Speaking required extensive training on the voice of the user, and the result wasn't as accurate as modern machine-learning speech-to-text models. You had to watch it and actively correct it's results.

IMO, the accuracy of modern speech-to-text models is still nowhere near accurate enough. Maybe they should bring back the per-user training

I remember seeing this on a trade fair as a child with my mother in 1993
Not gen 1 devices but in 2023 some of the echos do partial on-device processing of commands (same with HomePods and same with Google home).