Hacker News new | ask | show | jobs
by bradhilton 469 days ago
We used about 58 hours on 4xH100s and about 19 hours on 8xH100s to get the very best result with the 32B model. We trained for about another 16 hours before finishing the run, but we could have stopped earlier after it was apparent the model was regressing. Actual dollar costs are provider dependent.