|
|
|
|
|
by lolinder
1045 days ago
|
|
> A good example is that llama.cpp is a fantastic framework to run models locally for the single-user case (batch=1). While llama.cpp supports different backends (RPi, CPU, GPU), I don't think it would be particularly fair to compare and show that MKML is better at a given perplexity, compression ratio, and speed on GPU for a multi-user case (batch >> 1), when that is not llama.cpp’s targeted use case (afaik). Maybe I'm misunderstanding MKML. As I understood it, MKML is a compression step that then feeds into another framework like HF's Transformers or PyTorch. If that's the case, then comparing MKML to llama.cpp is apples to oranges—the correct comparison would be to GGML and the various quantization methods. The inference engine and its intended use cases aren't what's in question here. If a model compressed with MKML outperforms a standard quantized model in a batch setting, that's useful information! It would not be at all unfair for you to cite that as a strength, and it would increase your credibility because you wouldn't seem to be dodging the question of how you compare to your substitutes. |
|
We compared MKML mk600 (5.2GB) against llama.cpp Q5_1 (4.7GB) and Q6_k (5.1GB) on a 4090 for llama-7B. The test is the same in all cases: we generate 128 tokens from a single token prompt (batch=1) and measure performance of the forward pass during auto-regression.
(llama-7B, single prompt, batch=1)
MKML mk600: 125t/s
Llama.cpp Q5_1: 8̶4̶ 128 t/s
Llama.cpp Q6_k: 7̶8̶ 116 t/s
Our llama.cpp test: Build (https://github.com/ggerganov/llama.cpp#cublas):
make -j12 LLAMA_CUBLAS=1
Run:
./main -t 16 -ngl 3̶2̶ 35 -m llama-2-7b-chat.ggmlv3.q6_K.bin -p "?" -n 128
Please feel free to post your llama.cpp results if they are different.
>Maybe I'm misunderstanding MKML. As I understood it, MKML is a compression step that then feeds into another framework like HF's Transformers or PyTorch.
MKML is not a compression tool that feeds into another framework. It is an inference runtime (like FasterTransformers or vllm) except that MKML is also plug and play with existing frameworks like Hugging Face.