|
|
|
|
|
by infinityio
1202 days ago
|
|
in general, assume 2GB per billion parameters - with quantisation you can get this down to <1GB (~500MB for 3 bit?), but even with that you'll only be able to run quantised llama-13B in the best case Having said that: if you are feeling incredibly patient you can technically run the 68B parameter model by swapping to disk, although it will not be a pleasant experience (think minutes or hours per token instead of tokens per second) Additionally worth noting pure CPU inference is much slower than GPU/TPU inference, so the output will be much slower than a ChatGPT-like service even if it does fit in your computer's RAM |
|