| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by reissbaker 316 days ago
	40GB is small IMO: you can run it on a mid-tier Macbook Pro... or the smallest M3 Ultra Mac Studio! You don't need Nvidia if you're doing at-home inference, Nvidia only becomes economical at very high throughput: i.e. dedicated inference companies. Apple Silicon is much more cost effective for single-user for the small-to-medium-sized models. The M3 Ultra is ~roughly on par with a 4090 in terms of memory bandwidth, so it won't be much slower, although it won't match a 5090. Also for a 20B model, you only really need 20GB of VRAM: FP8 is near-identical to FP16, it's only below FP8 that you start to see dramatic drop-offs in quality. So literally any Mac Studio available for purchase will do, and even a fairly low-end Macbook Pro would work as well. And a 5090 should be able to handle it with room to spare as well.

4 comments

dur-randir 316 days ago

Memory bandwidth is only relevant for comparing LLM performance. For image generation, the limiting factor is compute, and Apple sucks with it.

link

BoredPositron 316 days ago

If you want to wait 20 minutes for one image you can certainly run it on a macbook pro.

link

roenxi 316 days ago

The quality doesn't have to get much higher for that to be a great deal. For humans the wait time is typically measured in days.

link

BoredPositron 315 days ago

Tell me you have no experience with generative ai image models nor with human artists.

link

roenxi 315 days ago

What experience do you want to point too? I've never seen an artist streaming where they can draw something equivalent to a good piece of AI artwork in 20 minutes. Their advantage right now comes from a higher overall cap on quality of the work. Minute for minute, AIs are much better. It is just that it is pointless giving a typical AI more than a a little time on a GPU because current models can't consistently improve their own work.

link

jacquesm 315 days ago

"a good piece of AI artwork"

You really don't understand art. At all.

link

roenxi 314 days ago

If you need a hug, I suspect unfortunately I am on the wrong continent. Try thinking some positive thoughts.

link

RossBencina 316 days ago

Does M3 Ultra or later have hardware FP8 support on the CPU cores?

link

reissbaker 316 days ago

Ah, you're right: it doesn't have dedicated FP8 cores, so you'd get significantly worse performance (a quick Google search implies 5x worse). Although you could still run the model, just slowly.

Any M3 Ultra Mac Studio, or midrange-or-better Macbook Pro, would handle FP16 with no issues though. A 5090 would handle FP8 like a champ and a 4090 could probably squeeze it in as well, although it'd be tight.

link

slickytail 316 days ago

All of this only really applies to LLMs though. LLMs are memory bound (due to higher param counts, KV caching, and causal attention) whereas diffusion models are compute bound (because of full self attention that can't be cached). So even if the memory bandwidth of an M3 ultra is close to an Nvidia card, the generation will be much faster on a dedicated GPU.

link