Hacker News new | ask | show | jobs
by cfn 1046 days ago
Yes, I run the 4bit, 70B on a threadripper 32 core using llama.cpp. It uses around 37Gb of RAM and I get 4-5 tokens per second (slow but usable). Core usage is very uneven with many cores at 0% so maybe there's some more performance to be had in the future. Sometimes it gets stuck for a few seconds and then recovers.

It gives very detailed answers to coding questions and tasks just like GPT4 does (though I did not do a proper comparison).

The 13b uses 13Gb with 27 tokens per second the 7b uses 0.5Gb and I get 39 tokens per second on this machine.Both produce interesting results even for CUDA code generation, for example.

3 comments

Try to use less cores. RAM bandwidth is real limiting factor there, so there always some sweet spot between CPU cores and RAM bandwidth for individual system.

For example, I use only 6 cores from 10 on my M1 Pro laptop.

That is an interesting idea. Can you tell me what is the switch for number of cores?
-t 32

Use a maximum of the number of physical cores and then scale down.

Thanks, -t 32 (instead of -t 13 which is what comes as default) makes a big difference in CPU usage across all cores. Not quite 100% but all cores are above 50% with many at 100%. It speeds up just a tiny bit the eval t/s to 3.3 (from 2.9).
How does the 7B model use only 512 megabytes? That's not possible?

Is it using mmap and concealing the actual memory usage?

I forgot to say I am using ggml models. This is what llama.cpp outputs when you start it:

main: build = 942 (4f6b60c) main: seed = 1691400051 llama.cpp: loading model from /media/z/models/TheBloke_Llama-2-7b-chat-GGML/llama-2-7b-chat.ggmlv3.q5_1.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_head_kv = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 1 llama_model_load_internal: rnorm_eps = 1.0e-05 llama_model_load_internal: n_ff = 11008 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 9 (mostly Q5_1) llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: mem required = 4820.60 MB (+ 256.00 MB per state) llama_new_context_with_model: kv self size = 256.00 MB llama_new_context_with_model: compute buffer total size = 71.84 MB

You can see the memory required at 4820.60 MB (+ 256.00 MB per state). The process monitor (on Ubuntu) shows less than 400 Mb.

This is the command: ./main -eps 1e-5 -m /media/z/models/TheBloke_Llama-2-7b-chat-GGML/llama-2-7b-chat.ggmlv3.q5_1.bin -t 13 -p \ "[INST] <<SYS>>You are a helpful and concise assistant<</SYS>>Write a c++ function that calculates RMSE between two double lists using CUDA. Don't explain, just write out the code.[/INST]"

Yeah that's using over 5 gigabytes, not 400 megabytes. Your process monitor is inaccurate; the memory used doesn't "count" because it's disk backed and the kernel is free to discard memory pages if it really needs the memory because it can always load it back from disk. But every time it does that you need to wait for the slow disk to read it back in again.
It is strange that it does that given that there's plenty of free memory available in the system (it has 256Gb of RAM and wasn't running anything else).
Not really, it's just a question of accounting. mmap is functionally the same as disk cache. As long as you've got the RAM, it'll run from RAM. If you really want, you can force llama.cpp not to use mmap and explicitly load everything into RAM, but there's not really any performance reason to do that - if the kernel keeps dropping your pages, you're under memory pressure anyway and "locking" that memory will probably end up either thrashing or invoking the OOM killer.
5 tokens/s on 70B 4bit seems really high for your setup.
This is the command:

./main -m /media/z/models/TheBloke_Llama-2-70B-Chat-GGML/llama-2-70b-chat.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "The prompt..."

And this is the report at the end of the answer:

llama_print_timings: load time = 999.84 ms

llama_print_timings: sample time = 302.21 ms / 703 runs ( 0.43 ms per token, 2326.20 tokens per second)

llama_print_timings: prompt eval time = 69377.40 ms / 300 tokens ( 231.26 ms per token, 4.32 tokens per second)

llama_print_timings: eval time = 236017.69 ms / 701 runs ( 336.69 ms per token, 2.97 tokens per second)

llama_print_timings: total time = 305815.51 ms

The computer is a Ryzen threadripper pro 5975wx 32-cores × 64 with 256Gb of RAM. It also has a GPU but I checked with nvtop that nothing is being loaded to it.

For more standardized speed benchmarking, I'd recommend benchmarking with something like `-c 128 -n 1920 --ignore-eos` (and skipping the `-p` entirely). The number that you care most about would be the "eval time" tokens/second - it tends to get slower as context increases, which is why it's sort of important to standardize. I think 2-3 t/s is about what's expected (Threadripper Pro 5000 w/ 8 channels of DDR-3200 should have an expected theoretical top memory bandwidth of 204.8 GB/s - memory bandwidth is the main limiting factor for most systems for LLM inferencing).