Hacker News new | ask | show | jobs
by dgacmu 1128 days ago
Mostly you need to be able to stash intermediate products computed in the forward phase so that you can access them in the backward phase. This requires more memory, more memory bandwidth, more transpose, and also, training usually operates at slightly higher precision (bf16 instead of int8 as one example).
1 comments

What about the autodiff/VJP lookup table? What's the overhead like for those?
I think it's helpful to categorize the things that go into an ML accelerator as those that are big picture architectural - things like memory bandwidth and sizes, support for big operations like transposition, etc., -- and those that are fixed-function optimizations. In all of these systems, there's a compiler that's responsible for taking higher-level things and compiling them down to those low-level operations. And that includes the derivatives used in backprop - they just get mapped to the same plus a few more primitive operations. While there are few more fixed functions you need to add for loss functions and some derivatives, probably the largest difference is that you need to support transpose (and that you need all that extra memory & bandwidth to keep those intermediate products around in order to backprop on them)

This paper has a nice summary of the challenges of going from an inference-only TPU to the inference-capable TPUv2: https://ieeexplore.ieee.org/document/9351692

Look for the section "CHALLENGES AND OPPORTUNITIES OF BUILDING ML HARDWARE"

But then things change more when you want to start supporting embeddings, so Google's TPUs have included a "sparse core" to separately handle those (the lookup and memory use patterns are drastically different from that of the typical dense matrix operations used for non-embedding layers) since TPUv2: https://arxiv.org/pdf/2304.01433.pdf