|
|
|
|
|
by goldenarm
5 days ago
|
|
Consumer and server hardware are quite different, especially Google's TPUs. They notably have much larger mixture-of-experts ratios and more complex caching systems. At such scale and inference budgets, they are incentivised to optimize as much as possible. Also Google Deepmins has a six month embargo on strategic papers, so I bet the juiciest quantization tech isn't public yet. |
|