|
|
|
|
|
by cfn
1041 days ago
|
|
I forgot to say I am using ggml models. This is what llama.cpp outputs when you start it: main: build = 942 (4f6b60c)
main: seed = 1691400051
llama.cpp: loading model from /media/z/models/TheBloke_Llama-2-7b-chat-GGML/llama-2-7b-chat.ggmlv3.q5_1.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_head_kv = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 1.0e-05
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: mem required = 4820.60 MB (+ 256.00 MB per state)
llama_new_context_with_model: kv self size = 256.00 MB
llama_new_context_with_model: compute buffer total size = 71.84 MB You can see the memory required at 4820.60 MB (+ 256.00 MB per state). The process monitor (on Ubuntu) shows less than 400 Mb. This is the command:
./main -eps 1e-5 -m /media/z/models/TheBloke_Llama-2-7b-chat-GGML/llama-2-7b-chat.ggmlv3.q5_1.bin -t 13 -p \
"[INST] <<SYS>>You are a helpful and concise assistant<</SYS>>Write a c++ function that calculates RMSE between two double lists using CUDA. Don't explain, just write out the code.[/INST]" |
|