./main -m /media/z/models/TheBloke_Llama-2-70B-Chat-GGML/llama-2-70b-chat.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "The prompt..."
And this is the report at the end of the answer:
llama_print_timings: load time = 999.84 ms
llama_print_timings: sample time = 302.21 ms / 703 runs ( 0.43 ms per token, 2326.20 tokens per second)
llama_print_timings: prompt eval time = 69377.40 ms / 300 tokens ( 231.26 ms per token, 4.32 tokens per second)
llama_print_timings: eval time = 236017.69 ms / 701 runs ( 336.69 ms per token, 2.97 tokens per second)
llama_print_timings: total time = 305815.51 ms
The computer is a Ryzen threadripper pro 5975wx 32-cores × 64 with 256Gb of RAM. It also has a GPU but I checked with nvtop that nothing is being loaded to it.
For more standardized speed benchmarking, I'd recommend benchmarking with something like `-c 128 -n 1920 --ignore-eos` (and skipping the `-p` entirely). The number that you care most about would be the "eval time" tokens/second - it tends to get slower as context increases, which is why it's sort of important to standardize. I think 2-3 t/s is about what's expected (Threadripper Pro 5000 w/ 8 channels of DDR-3200 should have an expected theoretical top memory bandwidth of 204.8 GB/s - memory bandwidth is the main limiting factor for most systems for LLM inferencing).
./main -m /media/z/models/TheBloke_Llama-2-70B-Chat-GGML/llama-2-70b-chat.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "The prompt..."
And this is the report at the end of the answer:
llama_print_timings: load time = 999.84 ms
llama_print_timings: sample time = 302.21 ms / 703 runs ( 0.43 ms per token, 2326.20 tokens per second)
llama_print_timings: prompt eval time = 69377.40 ms / 300 tokens ( 231.26 ms per token, 4.32 tokens per second)
llama_print_timings: eval time = 236017.69 ms / 701 runs ( 336.69 ms per token, 2.97 tokens per second)
llama_print_timings: total time = 305815.51 ms
The computer is a Ryzen threadripper pro 5975wx 32-cores × 64 with 256Gb of RAM. It also has a GPU but I checked with nvtop that nothing is being loaded to it.