Right. But ... this would limit you to either extremely small models or extremely large FPGA's, yes? If there's a simple machine learning task that requires a sub microsecond latency I can see the point but otherwise??
Yes, definitely: this type of work is applicable in domains where software run on general-purpose processors cannot meet latency or power requirements.
Yes, but simple models are far more expressive than people give them credit for.
As one example, I've shoved <100 parameter networks into driver code before and hand-tuned them to run in 10-20 nanoseconds. E.g., touchpad hardware tends to suck, especially as it ages, sometimes generating thousands of phantom events per second and causing drift and other such issues. Typically that's solved via careful tuning of hysteresis and other parameters, but the problem is actually very amenable to neural nets. It's easy to collect good-enough data en masse, and you can tune precision vs recall to bias heavily toward dropping more events without any issues (doing so has the effect of slightly slowing down the mouse pointer, which you can compensate for at the OS level where you adjust pointer speed) to achieve 100% reduction of the phantom events.
Lots of image recognition tasks ( like spotting undesirable products in industrial settings), image modification tasks (I have some models locally to process hand-drawn images and unwarp them, remove notebook paper lines, etc), audio modification tasks (part of my editing pipeline includes hand-editing audio to achieve some effect, doing that a few times, and training models to copy that edit), and all sorts of other things are similarly doable in much smaller models than you might think -- not as small as that driver code, but still small enough to fit in hobbyist FPGAs.
Not all of those require low latency or high throughput, but audio processing is expensive, so high throughput is nice; industrial applications often operate on fast streams of many products, so both throughput and latency are important; and more generally when you have fast models available (or any fast code really) you'll tend toward different thought patterns and creative ideas which you wouldn't have even considered otherwise and which wouldn't be possible without those faster solutions.
Now that I think about it, we average 1.5M inferences per second at $WORK, expected to scale up 10-30x this year, and we have a moderately tight latency budget. This solution wouldn't fit without a larger, more expensive FPGA, at least not unless KANs are comparatively that much more expressive than our current solution (based on past experimentation, my hunch is that they're not, but you never know), but it's borderline useful.
Some very cool applications of small models! It seems that this scale of models tends to be sufficient when doing simpler classification, anomaly detection, signal processing, etc. as compared to generative modeling (where larger models are usually necessary).
Yep, as a rule of thumb generative models need to be much larger. As a small caveat, that's because of what we're doing with those models; generation itself can also be tiny and fast, but only when the output space is sufficiently constrained. Next-word prediction (in keyboards), speech codecs (TTS, especially for blind people), and a number of other scenarios both admit small models and fall into the domain of what most experts would call "generative."
One primary application of this work is in high-energy physics (https://home.cern/smarter-decisions-at-the-speed-of-collisio...). Ultrafast and real-time learning is also very applicable for problems in quantum computing, plasma control, etc. (https://arxiv.org/pdf/2602.02005).