|
|
|
|
|
by nijave
3 hours ago
|
|
Which model? I see a suspiciously similar post on amd.com running 2 bit Kimi quant on a four node cluster over 5Gbps Ethernet Assuming math works here although I think there's some caveats depending on the model architecture, 1T 4 bit is 465Gi just for the weights so you wouldn't be able to fit kv cache. It's showing about 8-9 tk/sec which seems quite slow for something like a web search with result aggregate although maybe bareable for smaller context stuff The thing I've been running into with z.ai hosted GLM-5.2 is the 2024 knowledge cutoff. Anything recent requires web augmentation which is more token intensive so low tk/sec hurts even more than a "smarter" model It seems (somewhat unsurprisingly) open weight models have older knowledge cutoffs. |
|
I’m not trying to say that this would be a great experience or really compete with just buying a subscription to the top models. Rather I just wanted to point out that $300k is an absurd estimate for a trillion param model meant for personal use.