Do you see an eventual future where some notional "model-on-chip" would hard-wire something like whisper into a dedicated integrated low-power chip for these more demanding uses?
It’s certainly possible. However, consider the market dynamics.
Look at the Coral accelerator from Google. It’s $60. It has 6m TOPS.
Sounds great, until you dig just a little bit deeper.
It has 6-8mb of memory. A speech recognition model of sufficient quality for these tasks is measured in hundreds of megabytes. Non-starter.
Even with the might of Google behind it the price point, performance, memory, and therefore utility is quite limited for all but a few bespoke applications. Google also has a lot of experience with their TPUs from phones to datacenters so they reduced costs and benefited from shortcuts via that experience and scale.
Yet the capabilities and software ecosystem are pathetic, with even the official Python implementation not having a single commit for 18 months, being stuck on Python < 3.10.
A random $100 used Nvidia card has 8GB of VRAM, 6 TFLOPS, and over 200GB/s of memory bandwidth. CUDA is also hands down the most well supported software ecosystem. There isn’t anything in ML that doesn’t have tier 1 support for CUDA, and vice-versa. Even this ancient card fully supports CUDA 12, so its future proof well into a decade past release date.
If Google can’t pull off something targeting this market with reasonable availability, price points, and software support a new entrant in the field doesn’t stand a chance.
If someone tried to manufacture such a device between the low manufacturing/sales volume, additional memory, and software ecosystem it would likely come in at multiples of the cost of a used Nvidia GPU and even then it couldn’t remotely compete on software.
GPUs catch a lot of flack on power usage but here’s the thing: my GTX 1070 idles at 10 watts with all models loaded. It can do frigate, transcoding with plex/jellyfin, and Willow voice sessions in it’s sleep and still have 80% of the VRAM free for whatever else I want to throw on it down the line.
It’s very difficult to compete with. Not impossible, but a very special set of things would have to come together to stand a chance.
The only thing I can possibly think of is a Raspberry Pi variant with an NPU and unified memory, but even that ecosystem would have a lot of work ahead of it to match what Nvidia (a $1T company) has built over 15 years with CUDA.
It’s certainly possible. However, consider the market dynamics.
Look at the Coral accelerator from Google. It’s $60. It has 6m TOPS.
Sounds great, until you dig just a little bit deeper.
It has 6-8mb of memory. A speech recognition model of sufficient quality for these tasks is measured in hundreds of megabytes. Non-starter.
Even with the might of Google behind it the price point, performance, memory, and therefore utility is quite limited for all but a few bespoke applications. Google also has a lot of experience with their TPUs from phones to datacenters so they reduced costs and benefited from shortcuts via that experience and scale.
Yet the capabilities and software ecosystem are pathetic, with even the official Python implementation not having a single commit for 18 months, being stuck on Python < 3.10.
A random $100 used Nvidia card has 8GB of VRAM, 6 TFLOPS, and over 200GB/s of memory bandwidth. CUDA is also hands down the most well supported software ecosystem. There isn’t anything in ML that doesn’t have tier 1 support for CUDA, and vice-versa. Even this ancient card fully supports CUDA 12, so its future proof well into a decade past release date.
If Google can’t pull off something targeting this market with reasonable availability, price points, and software support a new entrant in the field doesn’t stand a chance.
If someone tried to manufacture such a device between the low manufacturing/sales volume, additional memory, and software ecosystem it would likely come in at multiples of the cost of a used Nvidia GPU and even then it couldn’t remotely compete on software.
GPUs catch a lot of flack on power usage but here’s the thing: my GTX 1070 idles at 10 watts with all models loaded. It can do frigate, transcoding with plex/jellyfin, and Willow voice sessions in it’s sleep and still have 80% of the VRAM free for whatever else I want to throw on it down the line.
It’s very difficult to compete with. Not impossible, but a very special set of things would have to come together to stand a chance.
The only thing I can possibly think of is a Raspberry Pi variant with an NPU and unified memory, but even that ecosystem would have a lot of work ahead of it to match what Nvidia (a $1T company) has built over 15 years with CUDA.