Hacker News new | ask | show | jobs
by infinityio 1202 days ago
in general, assume 2GB per billion parameters - with quantisation you can get this down to <1GB (~500MB for 3 bit?), but even with that you'll only be able to run quantised llama-13B in the best case

Having said that: if you are feeling incredibly patient you can technically run the 68B parameter model by swapping to disk, although it will not be a pleasant experience (think minutes or hours per token instead of tokens per second)

Additionally worth noting pure CPU inference is much slower than GPU/TPU inference, so the output will be much slower than a ChatGPT-like service even if it does fit in your computer's RAM

1 comments

thanks for explaining! How much GPU memory would work nice with 68B?
they said 2g per 1 billion....and it's called 68B...I presume that's 68 billion... 68*2...so at least 136g?
68/2, not 682
So, if I understand correctly, that's what you need to run the best model?

With GPU:

VRAM + RAM >= 68/2

Without GPU:

RAM >= 68/2

Not sure about the "=" part. You'd want some memory for the compositor and other OS graphics, and regular RAM for OS and programs, no?