Problems reducible even partially to matrix math are for many practical purposes embarrassing parallel even within a single core. A couple hundred million FLOPS with 1990s SIMD support will let you run nearly all near-SOTA models within, idk, 3s, with most running in 0.1 or 0.01s. That’s pretty fast considering it’s an EP32 and some of these capabilities/models didn’t even exist a year ago.
Your expectation was not really wrong, because for most purposes, when discussing a “model” one is really talking about “capabilities”. And capabilities often require many calls to the model. And that capability may be reliant on being refreshed very rapidly… and now your 0.1s is not even slow, it’s almost existentially slow.
Re: training. even on the EP32, training is entirely doable, so long as you pretend you are in 2011 solving 2011 problems hahaha
In most MCU there is not an FPU so all floating point compute is emulated with software, so it's really slow. But yes, simple SIMD on integer improve so much the performance !
The main limitation is often not the time to process but the RAM available, some architecture of model need to keep multiple layers in ram or very big layers, and you hit the hard limit of RAM pretty quickly.
Concerning the training on MCU, it's possible but with simple need and special architecture of model, again the RAM is the limit.
I ended up hand rolling a custom micropython module for the S3 to do a proof of concept handwriting detection demo on an ESP32, might be interesting to some.
Great post with very interesting detail, thanks !
Another optimization could be to quantize the model, this transform all compute as int compute and not as floating point compute. You can lose some accuracy, but for any bigger model it's a requirement !
Espressif do a great job on the TinyML part, they have different library for different level of abstraction. You can check https://github.com/espressif/esp-nn that implement all low level layers. It's really optimized and if you use the esp32-s3 it will unlock a lot of performance by using the vector instructions.
You are right I should definitely be looking into how to run these models as ints as well, especially with the C optimizations to micropython you would see a lot larger performance gains using ints compared to floats. Definitely need to find some time to try it!
On the other hand the tinyML library looks great too and if I was going to do this for a product that would likely be the direction I would end up taking just cause it would be more extensible and better supported.
Your expectation was not really wrong, because for most purposes, when discussing a “model” one is really talking about “capabilities”. And capabilities often require many calls to the model. And that capability may be reliant on being refreshed very rapidly… and now your 0.1s is not even slow, it’s almost existentially slow.
Re: training. even on the EP32, training is entirely doable, so long as you pretend you are in 2011 solving 2011 problems hahaha