Badass. I've spent the last week digging into ARM optimization for these models because it's really fascinating how close we are to local deployment for this stuff - writeups like these should help spread awareness.
Thnx for writing!
In academia we're getting the next step operational: training on Android. Any advise for us what to watch out for?
Obviously you need a bit of patience and lots of volunteer devices. With unsupervised continuous learning this is solved, in emulation. See "G-Rank: Unsupervised Continuous Learn-to-Rank for Edge Devices in a P2P Network" [1]. Optimal learning rate is left as an exercise for the developer.
(disclaimer: our own work, I run a lab with "systems for peer-to-peer machine learning")