Hacker News new | ask | show | jobs
by brucethemoose2 929 days ago
It would be very tight. 8x7B 24GB (currently) has more overhead than 70B.

Its theoretically doable, with quantization from the recent 2 bit quant paper and a custom implementation (in exllamav2?)

EDIT: Actually the download is much smaller than 8x7B. Not sure how, but its sized more like a 30B, perfect for a 3090. Very interesting.