Hacker News new | ask | show | jobs
by londons_explore 1262 days ago
Slower training tends to be only a little cheaper, because most modern architectures parallelize well, and they just care about the number of flops.

If you want to reduce cost, you need to reduce the model size, and you'll get worse results for less money.