Hacker News new | ask | show | jobs
by gwern 883 days ago
Heh. This problem reminds me of back in 2019 when I was working with Shawn Presser on finetuning GPT-2 using Google Colab - there was a problem where it would randomly error out every once in a while, but also it would take like 10 minutes to redownload the last saved checkpoint from our server IIRC and it would take minutes to save the current checkpoint, so the question was, how often should we save to minimize the time spent restoring+saving? I did a bit of algebra and I think we wound up with an answer like '40 minutes'!

DL infrastructure & training practices have gotten better since then...