| Appreciate your response. We compared MKML mk600 (5.2GB) against llama.cpp Q5_1 (4.7GB) and Q6_k (5.1GB) on a 4090 for llama-7B. The test is the same in all cases: we generate 128 tokens from a single token prompt (batch=1) and measure performance of the forward pass during auto-regression. (llama-7B, single prompt, batch=1) MKML mk600: 125t/s Llama.cpp Q5_1: 8̶4̶ 128 t/s Llama.cpp Q6_k: 7̶8̶ 116 t/s Our llama.cpp test:
Build (https://github.com/ggerganov/llama.cpp#cublas): make -j12 LLAMA_CUBLAS=1 Run: ./main -t 16 -ngl 3̶2̶ 35 -m llama-2-7b-chat.ggmlv3.q6_K.bin -p "?" -n 128 Please feel free to post your llama.cpp results if they are different. >Maybe I'm misunderstanding MKML. As I understood it, MKML is a compression step that then feeds into another framework like HF's Transformers or PyTorch. MKML is not a compression tool that feeds into another framework. It is an inference runtime (like FasterTransformers or vllm) except that MKML is also plug and play with existing frameworks like Hugging Face. |
If you bothered to look at the llama.cpp output, you would see this line: llama_model_load_internal: offloaded 32/35 layers to GPU
The "-ngl 32" means that only 32 out of 35 layers are being run on the GPU, and this results in a huge slow down as the GPU syncs with the CPU, and then computes the last 3 layers on the CPU.
On my XTX 7900, I get a 55% speed up on llama.cpp (to 132.61 tok/sec) when running all layers on the GPU, rather than only 32 as in your measurements.