Hacker News new | ask | show | jobs
by dekhn 1481 days ago
TPUs make numerical errors more frequently than you'd expect- and it's not cosmic rays, it's QA errors (individual chips were manufactured that passed QA but very occasionally, for specific inputs and operations, produce garbage). When you run on a full pod, many workloads will eventually see corruptions, often in the form of a propagating NaN in critical data like the gradient or weights, that the training cannot recover from.

In fact in a recent big paper from Google they mentioned that training occasionally went wonky in completely nonreproducible ways, but I am pretty sure I know what happened.