|
|
|
|
|
by vlovich123
103 days ago
|
|
The hardware difference explains runtime performance differences, not task performance. Speculation is that the frontier models are all below 200B parameters but a 2x size difference wouldn’t fully explain task performance differences |
|
Some versions of some the models are around that size, which you might hit for example with the ChatGPT auto-router.
But the frontier models are all over 1T parameters. Source: watch interview with people who have left one of the big three labs and now work at the Chinese labs and are talking about how to train 1T+ models.