| I still think the defining moment for ML inference (and maybe even training!) on embedded devices will come when there are viable special-purpose, low-power ML chips. As much as I hate to do this, I'm going to make a comparison to Bitcoin mining. Mining is all about optimizing hashes/joule to get the best ROI. We watched it go from CPU -> GPU -> FPGA -> ASIC in the quest for efficiency. In ways, we're seeing the same thing in ML model training and inference. CPU -> GPU -> TPU. We're even seeing some special-purpose coprocessors deployed, as in the iPhone X. (https://www.wired.com/story/apples-neural-engine-infuses-the...) But I think the final leap will come by going from digital execution to application-specific analog computing. If you don't need high precision, you can compute extremely quickly and efficiently using properly-configured analog circuits. IBM is working on this kind of system with their TrueNorth line (https://techcrunch.com/2017/06/23/truenorth/) It hasn't been proven yet, but I think there is huge potential. |
For inference, GPUs are also pretty damn efficient since it's an embarrassingly parallel task w/ minimal synchronization (no gradient updates needed). In this case, FPGAs are a far better choice since you can push updates to accommodate new network architectures, activation functions, ,etc. The TPU instead relies on a matrix-multiplier unit which supports more use cases but won't be as performant on something like an RNN.