| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mft_ 18 days ago

Genuine question: is this solving a real problem?

IME, the bottleneck when using diffusion models isn't storage space or memory, it's generation time. Lots of models will run on 8-12 GB 1080-generation GPUs onwards, or on Macs with similar memory, which are probably the bottom end from a GPU power perspective anyway. I also note that these models are marginally slower than the small FLUX.2 model they're based on.

Okay, maybe this allows running a local model on something that has a reasonably powerful GPU and limited memory, like an iPhone, but is that really a common requirement?

13 comments

soerxpso 18 days ago

It's useful progress. Decent-fidelity local-scale inference means that you can create a product that generates throwaway images frequently without worrying about cost. Thus far every product I've seen that generates images is metered, which severely limits the value. I don't know if this is actually at the "decent fidelity" point yet.

moralestapia 18 days ago

Genuine question: doesn't it blow your mind that there exists a 1 Gigabyte file/program that can generate any image you can think of just from a rough description of it?

woadwarrior01 18 days ago

Where are you getting the 1 Gigabyte number from?

Their 1-bit quantized Diffusion Transformer is just under 1 GB. You also need the text-encoder (4-bit quantized) and VAE (unquantized) for inference and their combined weight is ~3.42 GB.

TBF, even at that size it's no less mind blowing.

SamBam 18 days ago

Same order of magnitude.

mft_ 18 days ago

Yeah, it's pretty incredible. And I guess that's mostly what's behind the question: whether this is more of an impressive research/technique demonstrator, or a real product advancement solving a need.

hk__2 18 days ago

> doesn't it blow your mind that there exists a 1 Gigabyte file/program that can generate any image you can think of just from a rough description of it?

I can make this into a 5-lines Python program. I’m not saying the images will match the description, but that isn’t part of your spec ;)

joblessjunkie 18 days ago

We are in an era of extreme demand for GPU and limited supply. Every inference we push to the edge frees cloud resources for other tasks. Every efficiency gain increases what we can achieve with existing resources. If images can be rendered with half as much compute, we need half as many GPUs.

cheesecakegood 18 days ago

… or generate twice as many images. Maybe not quite, but if we’ve seen anything with AI so far is that it fits Parkinson’s law pretty well.

SwellJoe 18 days ago

I think the value of it is currently more academic than useful in the real world. Everything at the frontier is still only marginally Good Enough (in image generation, most of it is shit even from the best models), so things far behind the frontier in terms of capability (as a tiny 1-bit model necessarily must be) are unusable.

But, getting remarkably higher density of capability per unit of compute is a big thing. It means the frontier can get better and cheaper to operate and less resource hungry, and it means what can be accomplished at the edge, on personal laptops or phones, becomes a broader spectrum of tasks.

And, for privacy, there are a lot of things that should run on-device and not everyone has big dedicated GPUs.

goofy_lemur 18 days ago

Yes its a huge deal because these are starting to get bound by memory bandwidth not compute. therefore one bit wirfhts stream way faster leading to substantially better results. At least thats what Id guess!

liuliu 18 days ago

It solves part of the download issue if they actually delivers a 1-bit whole package (currently their download is around 3.5GiB, still not ideal since FLUX.2 [klein] 4B you can get a package including text encoder ~6 GiB).

For speed, no. Draw Things runs on iPhone just fine and generally faster than their implementation on the same model (FLUX.2 [klein] 4B).

kiicia 18 days ago

It’s like asking how did Memoji generation on iPhone solved a real problem?

It does not need to directly solve any particular problem to be overall good for consumers, by putting pressure to all those subscription based solutions… at least it’s private and does not require you to provide all your data…

fulafel 18 days ago

> Lots of models will run on 8-12 GB 1080-generation GPUs onwards, or on Macs with similar memory, which are probably the bottom end from a GPU power perspective anyway.

Not the bottom end - most people are on laptops or mobile devices that are much lower GPU power than this.

mft_ 18 days ago

Probably the bottom end an individual would want to consider using due to slow generation time.

Sure, you could theoretically take a model compressed in this manner and deploy it on an old netbook and run the calculations on the CPU, but each image would probably take an hour…

jeroenhd 18 days ago

My laptop has a Pascal-era Nvidia GPU with 4GiB of VRAM. It's not very efficient but it can do these tasks a whole lot faster than the CPU, but the 4GiB limitation pretty much limits its use to only the tiniest models.

If this model can run inside of the 4GiB limit, that makes this infinitely more useful than existing models for me.

fulafel 18 days ago

I was thinking more about the 0-3 year old midrange x86 laptops and phones, they have unified memory GPUs that are easily worth using (vs CPU), support narrow FP datatypes but don't have a ton of memory bandwidth.

mft_ 18 days ago

Fair enough :)

wmertens 17 days ago

What do you mean? In the whitepaper they say that the original can't run on an iPhone 17 at all, and on an M4 the Bonsai version runs 5.6x faster than the original.

This quantization has a small order of magnitude improvement on memory and compute requirements, how can it be slower?

And all that while retaining quality.

ppeetteerr 18 days ago

Yes, size and performance are not only problems for local LLMs, they are problems for frontier LLM companies like OpenAI and Anthropic. The latter still lose a ton of money on inference and advances in efficient, performant models helps their bottom line.

wmf 18 days ago

For free users, I guess local generation is going to be faster than waiting in a queue.

MrDrMcCoy 18 days ago

Lower memory use == higher speed. Memory bandwidth is conserved with less to transfer; this is the biggest bottleneck. Compressing your filesystem generally makes storage faster as well.

c0rruptbytes 18 days ago

ideally if ternary models work, the math is extremely easy for computers (addition/subtraction vs 16 bit multiplication)

jjcm 18 days ago

Not quite as I understand it. The ternary approach bonsai uses leverages a FP16 scaling factor that each value in the ternary maps to. You're still using 16 bit multiplication, it's just that the weights are far more compressed.

c0rruptbytes 18 days ago

fair, i think i was referring more to 1.58 bit architecture in general since the original paper (Figure 3) shows that we eliminate FP16 multiplication and addition just for INT8 addition. I need to dive deeper into bonsai overall if it differs

https://arxiv.org/pdf/2402.17764