| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by boredemployee 1202 days ago
	sorry for the extremely dumb question but is it possible to run the 68B model in a 8gb ram computer?

2 comments

infinityio 1202 days ago

in general, assume 2GB per billion parameters - with quantisation you can get this down to <1GB (~500MB for 3 bit?), but even with that you'll only be able to run quantised llama-13B in the best case

Having said that: if you are feeling incredibly patient you can technically run the 68B parameter model by swapping to disk, although it will not be a pleasant experience (think minutes or hours per token instead of tokens per second)

Additionally worth noting pure CPU inference is much slower than GPU/TPU inference, so the output will be much slower than a ChatGPT-like service even if it does fit in your computer's RAM

link

boredemployee 1202 days ago

thanks for explaining! How much GPU memory would work nice with 68B?

link

ukd1 1201 days ago

they said 2g per 1 billion....and it's called 68B...I presume that's 68 billion... 68*2...so at least 136g?

link

vishal0123 1201 days ago

68/2, not 682

link

boredemployee 1201 days ago

So, if I understand correctly, that's what you need to run the best model?