Hacker News new | ask | show | jobs
by dekhn 1266 days ago
I would be interested in hear the use cases for large scale validation of floating point data. I used to work with processors that occasionally corrupted operations due to hardware manufacturing defects and these kinds of problems are exceptionally hard to debug, so I'm curious what techniques are used.

In our case, we built programs that ran enormous numbers of semi-random programs on the accelerator and compared it to reliable results computed offline. About 1 in 1000 chips would - reproducibly - fail certain operations. Identifying this helped solve problems many of our researchers reported on specific accelerator clusters- they would get a Nan in their gradients which would kill training, and it was almost always explainable by a single processor (out of ~thousands) occasionally corrupting a float.