|
|
|
|
|
by dekhn
1481 days ago
|
|
TPUs make numerical errors more frequently than you'd expect- and it's not cosmic rays, it's QA errors (individual chips were manufactured that passed QA but very occasionally, for specific inputs and operations, produce garbage). When you run on a full pod, many workloads will eventually see corruptions, often in the form of a propagating NaN in critical data like the gradient or weights, that the training cannot recover from. In fact in a recent big paper from Google they mentioned that training occasionally went wonky in completely nonreproducible ways, but I am pretty sure I know what happened. |
|