Hacker News new | ask | show | jobs
by cjdell 968 days ago
I wonder if fast enough for wakeword detection in WASM. Picovoice worked extremely well for this but it's proprietary.
4 comments

There's also OpenWakeWord[0]. The models are readily available in tflite and ONNX formats and are impressively "light" in terms of compute requirements and performance.

It should be possible.

[0] - https://github.com/dscripka/openWakeWord

I would think that using any version of whisper for this use-case would be like digging a posthole in your front yard with an orbital directed energy cannon powered by a fusion reactor.
This is a common viewpoint.

Have you used Echo/Alexa and seen what people do with it?

"Alexa make an entry on my calendar for lunch with Guillermo, Brian, and Kyle next week Wednesday at noon at Giordano's on Ohio street in Chicago". From 10-15 feet away, often with all kinds of noise, echo, who knows what. A child mumbling french can get within range of an Echo device and do this (with varying degrees of success).

Yes a lot of that is handled on device in the audio frontend and elsewhere but it often still bleeds through and makes the fundamental speech recognition challenging. Not to mention bring your accent/voice/speech pattern.

That's firmly Whisper territory and doesn't even get into the flexible grammar, integrations, etc with entire other stacks.

Plus, many hundreds of millions of dollars and nearly a decade later Alexa still struggles with this.

Good response.

However, wouldn't your described use-case be an activity that occurs after wakeword activation? Then handoff the rest of the audiostream to Whisper for transcription?

Thanks!

Yes, that's exactly what we do[0] (just like the commercial stuff).

Wake word and VAD are low-resource and even an ESP chip can handle that + stream. The ESP-BOX-3 is actually our main target device for voice hardware interface. It's the nearly infinite audio, speech, grammar, language, etc variability and complexity where you need the "big guns".

Another thing that seems to be getting lost on people - user expectations for voice interfaces are pretty high. If wake fails, a transcript is wrong, speech rec is slow, etc it's easier, faster, and far less frustrating to just take your phone out of your pocket. At that point why even have something poorly attempting to do voice?

[0] - https://heywillow.io/how-willow-works/#willow-inference-serv...

I'm glad I'm not crazy :D

Do you see an eventual future where some notional "model-on-chip" would hard-wire something like whisper into a dedicated integrated low-power chip for these more demanding uses?

I get asked that a lot.

It’s certainly possible. However, consider the market dynamics.

Look at the Coral accelerator from Google. It’s $60. It has 6m TOPS.

Sounds great, until you dig just a little bit deeper.

It has 6-8mb of memory. A speech recognition model of sufficient quality for these tasks is measured in hundreds of megabytes. Non-starter.

Even with the might of Google behind it the price point, performance, memory, and therefore utility is quite limited for all but a few bespoke applications. Google also has a lot of experience with their TPUs from phones to datacenters so they reduced costs and benefited from shortcuts via that experience and scale.

Yet the capabilities and software ecosystem are pathetic, with even the official Python implementation not having a single commit for 18 months, being stuck on Python < 3.10.

A random $100 used Nvidia card has 8GB of VRAM, 6 TFLOPS, and over 200GB/s of memory bandwidth. CUDA is also hands down the most well supported software ecosystem. There isn’t anything in ML that doesn’t have tier 1 support for CUDA, and vice-versa. Even this ancient card fully supports CUDA 12, so its future proof well into a decade past release date.

If Google can’t pull off something targeting this market with reasonable availability, price points, and software support a new entrant in the field doesn’t stand a chance.

If someone tried to manufacture such a device between the low manufacturing/sales volume, additional memory, and software ecosystem it would likely come in at multiples of the cost of a used Nvidia GPU and even then it couldn’t remotely compete on software.

GPUs catch a lot of flack on power usage but here’s the thing: my GTX 1070 idles at 10 watts with all models loaded. It can do frigate, transcoding with plex/jellyfin, and Willow voice sessions in it’s sleep and still have 80% of the VRAM free for whatever else I want to throw on it down the line.

It’s very difficult to compete with. Not impossible, but a very special set of things would have to come together to stand a chance.

The only thing I can possibly think of is a Raspberry Pi variant with an NPU and unified memory, but even that ecosystem would have a lot of work ahead of it to match what Nvidia (a $1T company) has built over 15 years with CUDA.

This reminds me of discussions about superfluous information in human language sentences. Consider the phrase "that man is bad" versus "that man bad". Somewhat crappy example, but basically yes, an idea can be conveyed in a more efficient representation, but what is lost through compaction is redundancy in a noisy environment.

If all you're doing is parsing "Alexa" out of the air... you're going to have a bad time because realistically, there is a contextual requirement. In AI applications, a proof-of-concept is great, but 99.9% accuracy is basically useless. Think if computer RAM is accurate 99.9% of the time... that's a broken tool.

If it takes 2 seconds to say "Alexa", that's 43,200 2-second chunks in a day, but if the listener is using a sliding window at 60hz, that's 5.2 million opportunities to screw up each day. 99.9% success of parsing a 2-second slice of audio is insufficient.

At some point, no matter how much training you do for ONLY the word "Alexa", you're going to start getting diminishing returns, in which the model to reach desired accuracy will start getting bigger and bigger for less and less improvement. Logical context analysis can easily bridge the gap for much larger gains.

The model targets the decoder part of the system which is the speed bottleneck. So for tasks like classification it is not likely to be helpful. However a similar method could be used for that use case. (Coauthor)
It's probably still too big to be helpful with these model sizes, but if someone helpful runs the same training on `small.en` (and smaller) we might have something.

Yes, this is me praying to the benevolent HN gods that someone will pick this up and run with it. I don't have a GPU anywhere close to capable...

You'd be surprised how capable old GPUs are! I've had great success with people running Whisper-Turbo in the browser on really old hardware: https://whisper-turbo.com/
We have benchmarks[0] for Willow Inference Server using Whisper + ctranslate2 + some of our own optimizations.

TLD a six year old ~$100 used GTX 1070 is roughly 5x faster than a Threadripper PRO 5955WX at a fraction of the cost and power.

[0] - https://heywillow.io/components/willow-inference-server/#ben...

> TLD a six year old ~$100 used GTX 1070 is roughly 5x faster

Did you mean TIL?

It's not the inference, it's the training. They say in the paper: "We train with a batch size of 256 for a total of 80,000 optimisation steps, which amounts to eight epochs of training." That's a fair chunk of time. Mind you, `small.en` has smaller decoder layers than `medium.en`...