Hacker News new | ask | show | jobs
by easygenes 2 days ago
Article reads as though written by someone who doesn't have much experience with deployments like this. Underestimates the memory needed to run with a reasonable amount of context. Misses two other obvious targets:

  1) 4x DGX Spark (or equivalent other GB10 boxes) with a switch (MikroTik CRS504 or CRS804) and TP=4.
  2) 4x RTX PRO 6000 box. Probably the most practical for cost/perf if you want on-prem as an individual.
Both would be best to run a 2-bit quant so everything can stay resident (article claims you could run a 4-bit quant with 4x RTX 6000 Ada, and while technically true it would mean a lot of the weights are streaming from DRAM, so it would be slow and impractical. You would need 8x RTX PRO 6000 to run 4 bit at a good speed).

This model quantizes unusually well: https://unsloth.ai/docs/models/glm-5.2#quantization-analysis

1 comments

Can you really say you're running GLM 5.2 if its a 2 bit quant? It might be usable but the capabilities will definitely not be the same.