So, a way to think of this is: The speed (and therefore, cost) of training a machine learning model depends on (a) the ML techniques (how rapidly the model converges and to what accuracy); and (b) how quickly the processor executes the operations involved in the ML techniques.
The TPU is only an improvement in (b). It's not going to result in a big-O style speedup, because the same training algorithms and architectures will run on it that we run on CPUs & GPUs today.
I'm not sure what counts as "breaking new ground" - is that 10%? 100%? 1000? :-) The things to watch out for in benchmarks will be:
(a) Perf/$. This is actually a big deal - one of my students recently blew through $5000 of Google Cloud credits running Imagenet experiments, in a week. And we didn't finish them! As this cost really drops, it enables things like Neural Architecture Search, which uses tons of compute capability to explore architectural variants automatically.
(b) Absolute perf.
(c) Performance scaling. To what degree will the fast, 2D torroidal mesh allow a full pod of Cloud TPUs to scale nearly-linearly? Absolute training times matter from a user productivity standpoint. Waiting 30 minutes for a result is very different from waiting 12 hours (you can do one of these while you sneak out to go running! :-).
The TPU is only an improvement in (b). It's not going to result in a big-O style speedup, because the same training algorithms and architectures will run on it that we run on CPUs & GPUs today.
I'm not sure what counts as "breaking new ground" - is that 10%? 100%? 1000? :-) The things to watch out for in benchmarks will be:
(a) Perf/$. This is actually a big deal - one of my students recently blew through $5000 of Google Cloud credits running Imagenet experiments, in a week. And we didn't finish them! As this cost really drops, it enables things like Neural Architecture Search, which uses tons of compute capability to explore architectural variants automatically.
(b) Absolute perf.
(c) Performance scaling. To what degree will the fast, 2D torroidal mesh allow a full pod of Cloud TPUs to scale nearly-linearly? Absolute training times matter from a user productivity standpoint. Waiting 30 minutes for a result is very different from waiting 12 hours (you can do one of these while you sneak out to go running! :-).
The NIPS'17 slides have more technical context for some of this: https://supercomputersfordl2017.github.io/Presentations/Imag...