Hacker News new | ask | show | jobs
by boredemployee 1202 days ago
sorry for the extremely dumb question but is it possible to run the 68B model in a 8gb ram computer?
2 comments

in general, assume 2GB per billion parameters - with quantisation you can get this down to <1GB (~500MB for 3 bit?), but even with that you'll only be able to run quantised llama-13B in the best case

Having said that: if you are feeling incredibly patient you can technically run the 68B parameter model by swapping to disk, although it will not be a pleasant experience (think minutes or hours per token instead of tokens per second)

Additionally worth noting pure CPU inference is much slower than GPU/TPU inference, so the output will be much slower than a ChatGPT-like service even if it does fit in your computer's RAM

thanks for explaining! How much GPU memory would work nice with 68B?
they said 2g per 1 billion....and it's called 68B...I presume that's 68 billion... 68*2...so at least 136g?
68/2, not 682
So, if I understand correctly, that's what you need to run the best model?

With GPU:

VRAM + RAM >= 68/2

Without GPU:

RAM >= 68/2

Not sure about the "=" part. You'd want some memory for the compositor and other OS graphics, and regular RAM for OS and programs, no?
You can't, it needs around 40GB of RAM.

Technically you can by swapping to disk but it would be too slow to be usable.

What you can do however is use the 7B model with 4bit quantization and use it within 8GB RAM.