| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tetraodonpuffer 359 days ago
	I think the fact that, as far as I understand, it takes 40GB of VRAM to run, is probably dampening some of the enthusiasm. As an aside, I am not sure why for LLM models the technology to spread among multiple cards is quite mature, while for image models, despite also using GGUFs, this has not been the case. Maybe as image models become bigger there will be more of a push to implement it.

3 comments

reissbaker 359 days ago

40GB is small IMO: you can run it on a mid-tier Macbook Pro... or the smallest M3 Ultra Mac Studio! You don't need Nvidia if you're doing at-home inference, Nvidia only becomes economical at very high throughput: i.e. dedicated inference companies. Apple Silicon is much more cost effective for single-user for the small-to-medium-sized models. The M3 Ultra is ~roughly on par with a 4090 in terms of memory bandwidth, so it won't be much slower, although it won't match a 5090.

Also for a 20B model, you only really need 20GB of VRAM: FP8 is near-identical to FP16, it's only below FP8 that you start to see dramatic drop-offs in quality. So literally any Mac Studio available for purchase will do, and even a fairly low-end Macbook Pro would work as well. And a 5090 should be able to handle it with room to spare as well.

dur-randir 358 days ago

Memory bandwidth is only relevant for comparing LLM performance. For image generation, the limiting factor is compute, and Apple sucks with it.

BoredPositron 358 days ago

If you want to wait 20 minutes for one image you can certainly run it on a macbook pro.

roenxi 358 days ago

The quality doesn't have to get much higher for that to be a great deal. For humans the wait time is typically measured in days.

BoredPositron 358 days ago

Tell me you have no experience with generative ai image models nor with human artists.

roenxi 358 days ago

What experience do you want to point too? I've never seen an artist streaming where they can draw something equivalent to a good piece of AI artwork in 20 minutes. Their advantage right now comes from a higher overall cap on quality of the work. Minute for minute, AIs are much better. It is just that it is pointless giving a typical AI more than a a little time on a GPU because current models can't consistently improve their own work.

jacquesm 357 days ago

"a good piece of AI artwork"

You really don't understand art. At all.

RossBencina 359 days ago

Does M3 Ultra or later have hardware FP8 support on the CPU cores?

reissbaker 359 days ago

Ah, you're right: it doesn't have dedicated FP8 cores, so you'd get significantly worse performance (a quick Google search implies 5x worse). Although you could still run the model, just slowly.

Any M3 Ultra Mac Studio, or midrange-or-better Macbook Pro, would handle FP16 with no issues though. A 5090 would handle FP8 like a champ and a 4090 could probably squeeze it in as well, although it'd be tight.

slickytail 359 days ago

All of this only really applies to LLMs though. LLMs are memory bound (due to higher param counts, KV caching, and causal attention) whereas diffusion models are compute bound (because of full self attention that can't be cached). So even if the memory bandwidth of an M3 ultra is close to an Nvidia card, the generation will be much faster on a dedicated GPU.

cma 359 days ago

If 40GB you can lightly quantize and fit it on a 5090.

Auracle 358 days ago

Which very few people have, comparatively.

Training it will also be out of reach for most. I’m sure I’ll be able to handle it on my own 5090 at some point but it’ll be slow going.

TacticalCoder 359 days ago

> I think the fact that, as far as I understand, it takes 40GB of VRAM to run, is probably dampening some of the enthusiasm.

40 GB of VRAM? So two GPU with 24 GB each? That's pretty reasonable compared to the kind of machine to run the latest Qwen coder (which btw are close to SOTA: they do also beat proprietary models on several benchmarks).

cellis 359 days ago

A 3090 + 2xTitanXP? technically i have 48, but i don't think you can "split it" over multiple cards. At least with Flux, it would OOM the Titans and allocate the full 3090

Auracle 358 days ago

You can’t split image models over 2 GPUs like you can LLMs.

BoredPositron 358 days ago

They also released an inference server for their models. Wan and qwen-image can be split without problems. https://github.com/modelscope/DiffSynth-Engine

Auracle 357 days ago

Unless I missed something just from skimming their tutorial it looks like they can do parallelism to speed things up with some models, not actually split the model (apart from the usual chunk offloading techniques).