And I'd be able to train a learned optimizer to replace gradient descent as the training process.
Even without either of those, performance improves in a predictable way with more compute thanks to scaling laws.
And I'd be able to train a learned optimizer to replace gradient descent as the training process.
Even without either of those, performance improves in a predictable way with more compute thanks to scaling laws.