| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jlokier 929 days ago

> - 96GB of weights. You won't be able to run this on your home GPU.

You can these days, even in a portable device running on battery.

96GB fits comfortably in some laptop GPUs released this year.

2 comments

refulgentis 929 days ago

This is extremely misleading. source: been working in local LLMs since 10 months ago. Got my Mac laptop too. I'm bullish too. But we shouldn't breezily dismiss those concerns out of hand. In practice, it's single digit tokens a second on a $4500 laptop for a model with weights half this size (Llama 2 70B Q2 GGUF => 29 GB, Q8 => 36 GB)

MacsHeadroom 929 days ago

Mixtral 8x7b only needs 12B of weights in RAM per generation.

2B for the attention head and 5B from each of 2 experts.

It should be able to run slightly faster than a 13B desnse model, in as little as 16GB of RAM with room to spare.

filterfiber 929 days ago

> in as little as 16GB of RAM with room to spare.

I don't think that's the case, for full speed you still need (5B*8)/2+2~fewB overhead.

I think the experts chosen per-token? That means that yes you technically only need two in VRAM memory+router/overhead per token, but you'll have to constantly be loading in different experts unless you can fit them all, which would still be terrible for performance.

So you'll still be PCIE/RAM speed limited unless you can fit all of the experts into memory (or get really lucky and only need two experts).

dkarras 929 days ago

no doesn't work that way. experts can change per token so for interactive speeds you need all in memory unless you want to wait for model swaps between tokens.

coolspot 929 days ago

> $4500

Which is more than a price of RTX A6000 48gb ($4k used on ebay)

brucethemoose2 929 days ago

Which is outrageously priced, in case thats not clear. Its an 2020 RTX 3090 with doubled up memory ICs, which is not much extra BoM.

baq 929 days ago

Clearly it’s worth what people are willing to pay for it. At least it isn’t being used to compute hashes of virtual gold.

brucethemoose2 929 days ago

Its a artificial supply constraint due to artificial market segmentation enabled by Nvidia/AMD.

Honestly its crazy that AMD indulges in this, especially now. Their workstation market share is comparatively tiny, and instead they could have a swarm of devs (like me) pecking away at AMD compatibility on AI repos if they sold cheap 32GB/48GB cards.

baq 929 days ago

Never said it was ok! Just saying that there are people willing to pay this much, so it costs this much. I'd very much like to buy a 40GB GPU for this to, but at these prices this is not happening - I'd have to turn it into a business to justify this expense, but I just don't feel like it.

tucnak 929 days ago

People are also willing to die for all kinds of stupid reasons, and it's not indicative of _anything_ let alone a clever comment on the online forum. Show some decorum, please!

CamperBob2 929 days ago

How fast does it run on that?

refulgentis 929 days ago

quantization makes it hard to have exactly one answer -- I'd make a q0 joke, except that's real now -- i.e. reduce the 3.4 * 10^38 range of float 32 to 2, a boolean.

it's not very good, at all, but now we can claim some pretty massive speedups.

I can't find anything for llama 2 70B on 4090 after 10 minutes of poking around, 13B is about 30 tkn/s. it looks like people generally don't run 70B unless they have multiple 4090s.

michaelt 929 days ago

Be a lot cooler if you said what laptop, and how much quantisation you're assuming :)

tvararu 929 days ago

They're probably referring to the new MacBook Pros with up to 128GB of unified memory.

jlokier 929 days ago

Sibling commenter tvararu is correct. 2023 Apple Macbook with 128GiB RAM, all available to the GPU. No quantisation required :)

Other sibling commenter refulgentis is correct too. The Apple M{1-3} Max chips have up to 400GB/s memory bandwidth. I think that's noticably faster than every other consumer CPU out there. But it's slower than a top Nvidia GPU. If the entire 96GB model has to be read by the GPU for each token, that will limit unquantised performance to 4 tokens/s at best. However, as the "Mixtral" model under discussion is a mixture-of-experts, it doesn't have to read the whole model for each token, so it might go faster. Perhaps still single-digit tokens/s though, for unquantised.