Yes, I run the 4bit, 70B on a threadripper 32 core using llama.cpp. It uses around 37Gb of RAM and I get 4-5 tokens per second (slow but usable). Core usage is very uneven with many cores at 0% so maybe there's some more performance to be had in the future. Sometimes it gets stuck for a few seconds and then recovers.
It gives very detailed answers to coding questions and tasks just like GPT4 does (though I did not do a proper comparison).
The 13b uses 13Gb with 27 tokens per second the 7b uses 0.5Gb and I get 39 tokens per second on this machine.Both produce interesting results even for CUDA code generation, for example.
Try to use less cores. RAM bandwidth is real limiting factor there, so there always some sweet spot between CPU cores and RAM bandwidth for individual system.
For example, I use only 6 cores from 10 on my M1 Pro laptop.
Thanks, -t 32 (instead of -t 13 which is what comes as default) makes a big difference in CPU usage across all cores. Not quite 100% but all cores are above 50% with many at 100%. It speeds up just a tiny bit the eval t/s to 3.3 (from 2.9).
You can see the memory required at 4820.60 MB (+ 256.00 MB per state). The process monitor (on Ubuntu) shows less than 400 Mb.
This is the command:
./main -eps 1e-5 -m /media/z/models/TheBloke_Llama-2-7b-chat-GGML/llama-2-7b-chat.ggmlv3.q5_1.bin -t 13 -p \
"[INST] <<SYS>>You are a helpful and concise assistant<</SYS>>Write a c++ function that calculates RMSE between two double lists using CUDA. Don't explain, just write out the code.[/INST]"
Yeah that's using over 5 gigabytes, not 400 megabytes. Your process monitor is inaccurate; the memory used doesn't "count" because it's disk backed and the kernel is free to discard memory pages if it really needs the memory because it can always load it back from disk. But every time it does that you need to wait for the slow disk to read it back in again.
It is strange that it does that given that there's plenty of free memory available in the system (it has 256Gb of RAM and wasn't running anything else).
Not really, it's just a question of accounting. mmap is functionally the same as disk cache. As long as you've got the RAM, it'll run from RAM. If you really want, you can force llama.cpp not to use mmap and explicitly load everything into RAM, but there's not really any performance reason to do that - if the kernel keeps dropping your pages, you're under memory pressure anyway and "locking" that memory will probably end up either thrashing or invoking the OOM killer.
./main -m /media/z/models/TheBloke_Llama-2-70B-Chat-GGML/llama-2-70b-chat.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "The prompt..."
And this is the report at the end of the answer:
llama_print_timings: load time = 999.84 ms
llama_print_timings: sample time = 302.21 ms / 703 runs ( 0.43 ms per token, 2326.20 tokens per second)
llama_print_timings: prompt eval time = 69377.40 ms / 300 tokens ( 231.26 ms per token, 4.32 tokens per second)
llama_print_timings: eval time = 236017.69 ms / 701 runs ( 336.69 ms per token, 2.97 tokens per second)
llama_print_timings: total time = 305815.51 ms
The computer is a Ryzen threadripper pro 5975wx 32-cores × 64 with 256Gb of RAM. It also has a GPU but I checked with nvtop that nothing is being loaded to it.
For more standardized speed benchmarking, I'd recommend benchmarking with something like `-c 128 -n 1920 --ignore-eos` (and skipping the `-p` entirely). The number that you care most about would be the "eval time" tokens/second - it tends to get slower as context increases, which is why it's sort of important to standardize. I think 2-3 t/s is about what's expected (Threadripper Pro 5000 w/ 8 channels of DDR-3200 should have an expected theoretical top memory bandwidth of 204.8 GB/s - memory bandwidth is the main limiting factor for most systems for LLM inferencing).
It gives very detailed answers to coding questions and tasks just like GPT4 does (though I did not do a proper comparison).
The 13b uses 13Gb with 27 tokens per second the 7b uses 0.5Gb and I get 39 tokens per second on this machine.Both produce interesting results even for CUDA code generation, for example.