|
|
|
|
|
by ein0p
808 days ago
|
|
This only cuts compute by “up to” 50% and only during inference. Quadratic dependence on context size remains, as do the enormous memory requirements. For something to be considered a bulls eye in this space it has to offer nonlinear improvements on both of these axes, and/or be much faster to train. Until that happens, people, including Google will continue to train bog standard MoE and dense transformers. Radical experimentation at scale is too expensive even for megacorps at this point. |
|