Hacker News new | ask | show | jobs
by cypress66 1048 days ago
5 tokens/s on 70B 4bit seems really high for your setup.
1 comments

This is the command:

./main -m /media/z/models/TheBloke_Llama-2-70B-Chat-GGML/llama-2-70b-chat.ggmlv3.q4_0.bin -gqa 8 -t 13 -p "The prompt..."

And this is the report at the end of the answer:

llama_print_timings: load time = 999.84 ms

llama_print_timings: sample time = 302.21 ms / 703 runs ( 0.43 ms per token, 2326.20 tokens per second)

llama_print_timings: prompt eval time = 69377.40 ms / 300 tokens ( 231.26 ms per token, 4.32 tokens per second)

llama_print_timings: eval time = 236017.69 ms / 701 runs ( 336.69 ms per token, 2.97 tokens per second)

llama_print_timings: total time = 305815.51 ms

The computer is a Ryzen threadripper pro 5975wx 32-cores × 64 with 256Gb of RAM. It also has a GPU but I checked with nvtop that nothing is being loaded to it.

For more standardized speed benchmarking, I'd recommend benchmarking with something like `-c 128 -n 1920 --ignore-eos` (and skipping the `-p` entirely). The number that you care most about would be the "eval time" tokens/second - it tends to get slower as context increases, which is why it's sort of important to standardize. I think 2-3 t/s is about what's expected (Threadripper Pro 5000 w/ 8 channels of DDR-3200 should have an expected theoretical top memory bandwidth of 204.8 GB/s - memory bandwidth is the main limiting factor for most systems for LLM inferencing).