|
|
|
|
|
by cfn
1046 days ago
|
|
Yes, I run the 4bit, 70B on a threadripper 32 core using llama.cpp. It uses around 37Gb of RAM and I get 4-5 tokens per second (slow but usable). Core usage is very uneven with many cores at 0% so maybe there's some more performance to be had in the future. Sometimes it gets stuck for a few seconds and then recovers. It gives very detailed answers to coding questions and tasks just like GPT4 does (though I did not do a proper comparison). The 13b uses 13Gb with 27 tokens per second the 7b uses 0.5Gb and I get 39 tokens per second on this machine.Both produce interesting results even for CUDA code generation, for example. |
|
For example, I use only 6 cores from 10 on my M1 Pro laptop.