| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by infinityio 1202 days ago

in general, assume 2GB per billion parameters - with quantisation you can get this down to <1GB (~500MB for 3 bit?), but even with that you'll only be able to run quantised llama-13B in the best case

Having said that: if you are feeling incredibly patient you can technically run the 68B parameter model by swapping to disk, although it will not be a pleasant experience (think minutes or hours per token instead of tokens per second)

Additionally worth noting pure CPU inference is much slower than GPU/TPU inference, so the output will be much slower than a ChatGPT-like service even if it does fit in your computer's RAM

1 comments

boredemployee 1202 days ago

thanks for explaining! How much GPU memory would work nice with 68B?

link

ukd1 1202 days ago

they said 2g per 1 billion....and it's called 68B...I presume that's 68 billion... 68*2...so at least 136g?

link

vishal0123 1201 days ago

68/2, not 682

link

boredemployee 1201 days ago

So, if I understand correctly, that's what you need to run the best model?