Hacker News new | ask | show | jobs
by bbminner 684 days ago
Because accelerators (tpus, gpus) unlike ram/cpu are notoriously hard to timeshare and vitrualize. So if you get evicted in an environment like that, you have to reload your entire experiment state from a model checkpoint. With giant models like that, it might take dozens of minutes. As a result, I doubt that these experiments are done using "spare" resources - in that case, constant interruptions and reloading would result in these experiments finishing sometime around the heat death of the universe :)