Hacker News new | ask | show | jobs
by arugulum 1035 days ago
If you want a speedrun explanation for how we get to "2": In the limit of model scaling, context size doesn't matter (yes, forget about the quadratic attention), most of the compute is in the linear layers, which boil down to matrix multiplies. Consider a single matrix of size [T,d] multiplied by weight of size [d,d], the compute needed for a matrix multiplication is approximately 2Td^2 (2 coming from multiply + add). Swap T out with D for your whole dataset in tokens, d^2 is the number of parameters in a single linear layer so scale up your model to P, and you've got 2PD.

Even shorter: The 2 comes from the multiply-add