| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by andersa 702 days ago
	You can't fine tune a 70B model with this. It barely even fits the weights before it runs out of vram. Need a bigger machine.

1 comments

StrangeDoctor 702 days ago

I think the idea is you network them together if you need more and most models can be split nicely.

link

anon389r58r58 702 days ago

For that you'd probably be better off removing one of the GPUs, and replacing it with a networking card.

The problem of the form factor will remain. The tinybox is 15U big for compute that you'd normally expect to find in a 4U form factor.

link

andersa 702 days ago

I don't think they're intended for rack usage like that. More like for people to put under their desks... there would be no reason to build the giant case with fancy silent-ish cooling if you're going to put them next to your other jet engines.

link

anon389r58r58 702 days ago

Fully agree, and I think the tinybox is great if you put only one of them somewhere in your local office.

I just don't think it makes sense to connect multiple of them into a "cluster" to work with bigger models, as the networking bandwidth isn't good enough and you'd have to fit multiple of these big boxes into your local space. Then I might as well put up a rack in a separate room.

link

StrangeDoctor 702 days ago

there's an ocp 3.0 mezzanine, so no need to remove a card and you'd get 200gbps, unless I've missed something about needing to remove a card to access it. But yeah stacking these or racking them seems less than ideal.

link

omikun 702 days ago

3kW under your desk... no need to turn on the heat in the winter!

link

andersa 702 days ago

Most models actually can't be split nicely by 6. There's a reason nvidia builds nodes with 4 and 8 GPUs.

link

StrangeDoctor 702 days ago

I don't see why 6 is inherently worse than 4 or 8, not all of the layers are exactly equal or a power of 2 in count. 2^2, 2^3, vs 2^1*3^1 might give you more options.

The main issue I run into mainly is flops vs ram in any given card/model.

link

andersa 702 days ago

Usually you want to split each layer to run with tensor parallelism, which works optimally if you can assign each kv head to a specific GPU. All currently popular models have a power of 2 number of kv heads.

link

StrangeDoctor 702 days ago

interesting, thank you for the pointers.

link