|
|
|
|
|
by easygenes
2 days ago
|
|
Article reads as though written by someone who doesn't have much experience with deployments like this. Underestimates the memory needed to run with a reasonable amount of context. Misses two other obvious targets: 1) 4x DGX Spark (or equivalent other GB10 boxes) with a switch (MikroTik CRS504 or CRS804) and TP=4.
2) 4x RTX PRO 6000 box. Probably the most practical for cost/perf if you want on-prem as an individual.
Both would be best to run a 2-bit quant so everything can stay resident (article claims you could run a 4-bit quant with 4x RTX 6000 Ada, and while technically true it would mean a lot of the weights are streaming from DRAM, so it would be slow and impractical. You would need 8x RTX PRO 6000 to run 4 bit at a good speed).This model quantizes unusually well: https://unsloth.ai/docs/models/glm-5.2#quantization-analysis |
|