| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Cacti 890 days ago

Problems reducible even partially to matrix math are for many practical purposes embarrassing parallel even within a single core. A couple hundred million FLOPS with 1990s SIMD support will let you run nearly all near-SOTA models within, idk, 3s, with most running in 0.1 or 0.01s. That’s pretty fast considering it’s an EP32 and some of these capabilities/models didn’t even exist a year ago.

Your expectation was not really wrong, because for most purposes, when discussing a “model” one is really talking about “capabilities”. And capabilities often require many calls to the model. And that capability may be reliant on being refreshed very rapidly… and now your 0.1s is not even slow, it’s almost existentially slow.

Re: training. even on the EP32, training is entirely doable, so long as you pretend you are in 2011 solving 2011 problems hahaha

1 comments

cooootce 890 days ago

In most MCU there is not an FPU so all floating point compute is emulated with software, so it's really slow. But yes, simple SIMD on integer improve so much the performance !

The main limitation is often not the time to process but the RAM available, some architecture of model need to keep multiple layers in ram or very big layers, and you hit the hard limit of RAM pretty quickly.

Concerning the training on MCU, it's possible but with simple need and special architecture of model, again the RAM is the limit.