| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by danielhanchen 854 days ago
	I don't disagree - fair point there definitely is a latency transfer overhead. I would suspect one had prefetch it by calling `.to("cuda", non_blocking = True)` say 2 layers ahead, so you can in theory hide the movement. I think somewhere the blog did mention HQQ for 1 bit is slower for now, maybe due to the transfer overhead, although I couldn't exactly remember where

1 comments

Dylan16807 854 days ago

My point is more that if it's that many bytes flowing around on demand, you're basically swapping layers in and out of the GPU as you use them (or x% of each layer).

Which is fine, and it's a valid feature, but you don't need to split those bytes into "data" and "metadata" to make that happen.

Is there actually something they gain from this particular method of splitting?

link

danielhanchen 854 days ago

I guess it's mainly to reduce VRAM usage. Assuming we don't do this, then a 7b model with 1bit group size 8 will use 3GB or something of VRAM, whilst a 4bit group size of 64 will use 4GB approx.

Assume we have a 100b model - with 4bit, VRAM is around 57GB or so. With 1bit, 43GB VRAM, but by moving the scalars and zero point to RAM, VRAM use is like 15GB or something, at the cost of like 28GB RAM usage.

I guess maybe a valid approach is to dynamically select which ones to move to RAM or VRAM, given your VRAM budget. Say you have a 40GB GPU, clearly move more of the scalars to GPU. But if you have a weak GPU, then you need to use more RAM.

link

Dylan16807 853 days ago

I still don't think I'm getting my point across.

What if you store 40% of the data and 40% of the metadata in CPU memory. Is that the same for performance?

Is there a reason we want particular parts of the layer to stay on the GPU? Or is it just number of bytes.

link

danielhanchen 853 days ago

Oh, very good question - tbh im not sure. Another close technique is layer offloading - if your network can't fit and has layers 1, 2, ..., 32, we offload layers 16 to 32 to RAM, then load them in to GPU memory on the fly.

I'm gonna guess the performance hit is similar - although I have not tried it myself to verify for benchmarking

link