Hacker News new | ask | show | jobs
by kkielhofner 968 days ago
This is a common viewpoint.

Have you used Echo/Alexa and seen what people do with it?

"Alexa make an entry on my calendar for lunch with Guillermo, Brian, and Kyle next week Wednesday at noon at Giordano's on Ohio street in Chicago". From 10-15 feet away, often with all kinds of noise, echo, who knows what. A child mumbling french can get within range of an Echo device and do this (with varying degrees of success).

Yes a lot of that is handled on device in the audio frontend and elsewhere but it often still bleeds through and makes the fundamental speech recognition challenging. Not to mention bring your accent/voice/speech pattern.

That's firmly Whisper territory and doesn't even get into the flexible grammar, integrations, etc with entire other stacks.

Plus, many hundreds of millions of dollars and nearly a decade later Alexa still struggles with this.

1 comments

Good response.

However, wouldn't your described use-case be an activity that occurs after wakeword activation? Then handoff the rest of the audiostream to Whisper for transcription?

Thanks!

Yes, that's exactly what we do[0] (just like the commercial stuff).

Wake word and VAD are low-resource and even an ESP chip can handle that + stream. The ESP-BOX-3 is actually our main target device for voice hardware interface. It's the nearly infinite audio, speech, grammar, language, etc variability and complexity where you need the "big guns".

Another thing that seems to be getting lost on people - user expectations for voice interfaces are pretty high. If wake fails, a transcript is wrong, speech rec is slow, etc it's easier, faster, and far less frustrating to just take your phone out of your pocket. At that point why even have something poorly attempting to do voice?

[0] - https://heywillow.io/how-willow-works/#willow-inference-serv...

I'm glad I'm not crazy :D

Do you see an eventual future where some notional "model-on-chip" would hard-wire something like whisper into a dedicated integrated low-power chip for these more demanding uses?

I get asked that a lot.

It’s certainly possible. However, consider the market dynamics.

Look at the Coral accelerator from Google. It’s $60. It has 6m TOPS.

Sounds great, until you dig just a little bit deeper.

It has 6-8mb of memory. A speech recognition model of sufficient quality for these tasks is measured in hundreds of megabytes. Non-starter.

Even with the might of Google behind it the price point, performance, memory, and therefore utility is quite limited for all but a few bespoke applications. Google also has a lot of experience with their TPUs from phones to datacenters so they reduced costs and benefited from shortcuts via that experience and scale.

Yet the capabilities and software ecosystem are pathetic, with even the official Python implementation not having a single commit for 18 months, being stuck on Python < 3.10.

A random $100 used Nvidia card has 8GB of VRAM, 6 TFLOPS, and over 200GB/s of memory bandwidth. CUDA is also hands down the most well supported software ecosystem. There isn’t anything in ML that doesn’t have tier 1 support for CUDA, and vice-versa. Even this ancient card fully supports CUDA 12, so its future proof well into a decade past release date.

If Google can’t pull off something targeting this market with reasonable availability, price points, and software support a new entrant in the field doesn’t stand a chance.

If someone tried to manufacture such a device between the low manufacturing/sales volume, additional memory, and software ecosystem it would likely come in at multiples of the cost of a used Nvidia GPU and even then it couldn’t remotely compete on software.

GPUs catch a lot of flack on power usage but here’s the thing: my GTX 1070 idles at 10 watts with all models loaded. It can do frigate, transcoding with plex/jellyfin, and Willow voice sessions in it’s sleep and still have 80% of the VRAM free for whatever else I want to throw on it down the line.

It’s very difficult to compete with. Not impossible, but a very special set of things would have to come together to stand a chance.

The only thing I can possibly think of is a Raspberry Pi variant with an NPU and unified memory, but even that ecosystem would have a lot of work ahead of it to match what Nvidia (a $1T company) has built over 15 years with CUDA.