Hacker News new | ask | show | jobs
by cfn 1041 days ago
I forgot to say I am using ggml models. This is what llama.cpp outputs when you start it:

main: build = 942 (4f6b60c) main: seed = 1691400051 llama.cpp: loading model from /media/z/models/TheBloke_Llama-2-7b-chat-GGML/llama-2-7b-chat.ggmlv3.q5_1.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_head_kv = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 1 llama_model_load_internal: rnorm_eps = 1.0e-05 llama_model_load_internal: n_ff = 11008 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 9 (mostly Q5_1) llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: mem required = 4820.60 MB (+ 256.00 MB per state) llama_new_context_with_model: kv self size = 256.00 MB llama_new_context_with_model: compute buffer total size = 71.84 MB

You can see the memory required at 4820.60 MB (+ 256.00 MB per state). The process monitor (on Ubuntu) shows less than 400 Mb.

This is the command: ./main -eps 1e-5 -m /media/z/models/TheBloke_Llama-2-7b-chat-GGML/llama-2-7b-chat.ggmlv3.q5_1.bin -t 13 -p \ "[INST] <<SYS>>You are a helpful and concise assistant<</SYS>>Write a c++ function that calculates RMSE between two double lists using CUDA. Don't explain, just write out the code.[/INST]"

1 comments

Yeah that's using over 5 gigabytes, not 400 megabytes. Your process monitor is inaccurate; the memory used doesn't "count" because it's disk backed and the kernel is free to discard memory pages if it really needs the memory because it can always load it back from disk. But every time it does that you need to wait for the slow disk to read it back in again.
It is strange that it does that given that there's plenty of free memory available in the system (it has 256Gb of RAM and wasn't running anything else).
Not really, it's just a question of accounting. mmap is functionally the same as disk cache. As long as you've got the RAM, it'll run from RAM. If you really want, you can force llama.cpp not to use mmap and explicitly load everything into RAM, but there's not really any performance reason to do that - if the kernel keeps dropping your pages, you're under memory pressure anyway and "locking" that memory will probably end up either thrashing or invoking the OOM killer.