Hacker News new | ask | show | jobs
by ensan 1506 days ago
"The paper mentions 35 (!) manual restarts to train OPT-175B due to hardware failure (and 70+ automatic restarts)."

https://twitter.com/awnihannun/status/1521572873449533440