Its a artificial supply constraint due to artificial market segmentation enabled by Nvidia/AMD.
Honestly its crazy that AMD indulges in this, especially now. Their workstation market share is comparatively tiny, and instead they could have a swarm of devs (like me) pecking away at AMD compatibility on AI repos if they sold cheap 32GB/48GB cards.
Never said it was ok! Just saying that there are people willing to pay this much, so it costs this much. I'd very much like to buy a 40GB GPU for this to, but at these prices this is not happening - I'd have to turn it into a business to justify this expense, but I just don't feel like it.
People are also willing to die for all kinds of stupid reasons, and it's not indicative of _anything_ let alone a clever comment on the online forum. Show some decorum, please!
quantization makes it hard to have exactly one answer -- I'd make a q0 joke, except that's real now -- i.e. reduce the 3.4 * 10^38 range of float 32 to 2, a boolean.
it's not very good, at all, but now we can claim some pretty massive speedups.
I can't find anything for llama 2 70B on 4090 after 10 minutes of poking around, 13B is about 30 tkn/s. it looks like people generally don't run 70B unless they have multiple 4090s.