|
|
|
|
|
by rolisz
1156 days ago
|
|
Hardware isn't scaling exponentially anymore (Moore's law is dead). Parameter count isn't really scaling exponentially anymore either. GPT3 had 175b parameters 3 years ago. There are some attempts at training 1 trillion parameter models, but they are not better than GPT3. |
|
We're also seeing lots of optimizations with new models (RoPE/RoPER embedding, Swish/GeLU activation, Flash Attention, etc) but I think some the most interesting gains we'll be seeing soon is with inference-optimized training (-70% parameters for +100% compute) [1] combined with sparsity pruning (-50% size w/ almost no loss in accuracy) [2] and quantization [3] which will lead to significantly smaller models performing well.
[1] https://www.harmdevries.com/post/model-size-vs-compute-overh...
[2] https://arxiv.org/abs/2301.00774
[3] https://openreview.net/forum?id=tcbBPnfwxS