| HN Mirror

This is the command:

./main -m /media/z/models/TheBloke_Llama-2-70B-Chat-GGML/llama-2-70b-chat.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "The prompt..."

And this is the report at the end of the answer:

llama_print_timings: load time = 999.84 ms

llama_print_timings: sample time = 302.21 ms / 703 runs ( 0.43 ms per token, 2326.20 tokens per second)

llama_print_timings: prompt eval time = 69377.40 ms / 300 tokens ( 231.26 ms per token, 4.32 tokens per second)

llama_print_timings: eval time = 236017.69 ms / 701 runs ( 336.69 ms per token, 2.97 tokens per second)

llama_print_timings: total time = 305815.51 ms

The computer is a Ryzen threadripper pro 5975wx 32-cores × 64 with 256Gb of RAM. It also has a GPU but I checked with nvtop that nothing is being loaded to it.