One nice thing about Lite is that it's a lot easier to just include the operations you need (compared to TensorFlow 'classic'), there's fusion for common patterns, and the base interpreter is only 70KB. That covers a lot of the advantages of using XLA for mobile apps. In return you have the ability to load models separately from the code, and the ops are hand-optimized for ARM.
I'm still a fan of XLA, and I expect the two will grow closer over time, but I think Lite is better for a lot of scenarios on mobile.
How about quantization? Does tensorflow lite perform quantization or is it tensorflow supposed to do it? Is it iterative process or straightforward? Or are you training quantized models as nn api docs say?
The quantization is done with a special training script that is quantization aware. We will be open sourcing a mobilenet quantized training script to show how to do this soon.
TensorFlow Lite is an interpreter in contrast with XLA which is a compiler. The advantage of TensorFlow lite is that a single interpreter can handle several models rather than needing specialized code for each model and each target platform. TensorFlow Lite’s core kernels have also been hand-optimized for common machine learning patterns. The advantage of compiler approaches is fusing many operations to reduce memory bandwidth (and thus speed). TensorFlow lite fuses many common patterns in the TensorFlow converter. We are of course excited about the possibility of using JIT techniques and using XLA technology within the TensorFlow Lite interpreter or as part of the TensorFlow Lite converter as a possible future direction.
I'm still a fan of XLA, and I expect the two will grow closer over time, but I think Lite is better for a lot of scenarios on mobile.