| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by reilly3000 130 days ago
	Which takes a $20k thunderbolt cluster of 2 512GB RAM Mac Studio Ultras to run at full quality…

5 comments

0xbadcafebee 130 days ago

Most benchmarks show very little improvement of "full quality" over a quantized lower-bit model. You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.

link

MuffinFlavored 130 days ago

> You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.

Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?

link

bigyabai 130 days ago

> Are there a lot of options how "how far" do you quantize?

So many: https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...

> How much VRAM does it take to get the 92-95% you are speaking of?

For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.

link

MuffinFlavored 130 days ago

Thank you. Could you give a tl;dr on "the full model needs ____ this much VRAM and if you do _____ the most common quantization method it will run in ____ this much VRAM" rough estimate please?

link

omneity 130 days ago

It’s a trivial calculation to make (+/- 10%).

Number of params == “variables” in memory

VRAM footprint ~= number of params * size of a param

A 4B model at 8 bits will result in 4GB vram give or take, same as params. At 4 bits ~= 2GB and so on. Kimi is about 512GB at 4 bits.

link

polynomial 130 days ago

Depending on what your usage requirements are, Mac Minis running UMA over RDMA is becoming a feasible option. At roughly 1/10 of the cost you're getting much much more than 1/10 the performance. (YMMV)

https://buildai.substack.com/i/181542049/the-mac-mini-moment

link

danw1979 130 days ago

I did not expect this to be a limiting factor in the mac mini RDMA setup ! -

> Thermal throttling: Thunderbolt 5 cables get hot under sustained 15GB/s load. After 10 minutes, bandwidth drops to 12GB/s. After 20 minutes, 10GB/s. Your 5.36 tokens/sec becomes 4.1 tokens/sec. Active cooling on cables helps but you’re fighting physics.

Thermal throttling of network cables is a new thing to me…

link

cat_plus_plus 130 days ago

I admire patience of anyone who runs dense models on unified memory. Personally, I would rather feed an entire programming book or code directory to a sparse model and get an answer in 30 seconds and then use cloud in rare cases it's not enough.

link

polynomial 130 days ago

Luckily we're having a record cold winter and your setup can double as a personal space heater.

link

deaux 130 days ago

And that's at unusable speeds - it takes about triple that amount to run it decently fast at int4.

Now as the other replies say, you should very likely run a quantized version anyway.

link

bigyabai 130 days ago

"Full quality" being a relative assessment, here. You're still deeply compute constrained, that machine would crawl at longer contexts.

link

teaearlgraycold 130 days ago

Which while expensive is dirt cheap compared to a comparable NVidia or AMD system.

link

SchemaLoad 130 days ago

It's still very expensive compared to using the hosted models which are currently massively subsidised. Have to wonder what the fair market price for these hosted models will be after the free money dries up.

link

whatsupdog 130 days ago

I wonder if the "distributed AI computing" touted by some of the new crypto projects [0] works and is relatively cheaper.

0. https://www.daifi.ai/

link

cactusplant7374 130 days ago

Inference is profitable. Maybe we hit a limit and we don't need as many expensive training runs in the future.

link

paxys 130 days ago

Inference APIs are probably profitable, but I doubt the $20-$100 monthly plans are.

link

cactusplant7374 129 days ago

I wouldn’t be so sure. Most users aren’t going to use up their quota every week.

link

teaearlgraycold 130 days ago

For sure Claude Code isn’t profitable

link

bdangubic 130 days ago

Neither was Uber and … and …

link

plagiarist 130 days ago

Businesses will desire me for my insomnia once Anthropics starts charging congestion pricing.

link

blharr 130 days ago

What speed are you getting at that level of hardware though?

link