|
|
|
|
|
by achrono
5 days ago
|
|
How do we know that today's frontier models are merely scaled up versions of that? Genuine question, since the labs have narrowed what they share over the years to now almost nothing, in terms of how the model was trained and how it works under the hood. |
|
There have been minor changes to the architecture over the years, but these are basically all efficiency tweaks such as various types of attention (some pioneered in the open by DeepSeek) that better scale to large context lengths, and the confusingly named "mixture of experts" architecture, but what's more notable really is how little the architecture has changed. The capability gains have been coming from better training and better data.