Hacker News new | ask | show | jobs
by paul_mk1 1045 days ago
Hi, one of the founders here.

Attempting to address some of the comments in a single message.

To help understand why we decided not to compare to existing methods: I think it would be difficult to do so fairly, since there are many tradeoffs and different use cases. It's not always the case that one technique is bad and the other is good, it's more about the targeted design point (say, cloud vs local). We are openly offering our numbers / benchmarks and looking for early partners that are aligned with our current value proposition (hence the closed beta).

A good example is that llama.cpp is a fantastic framework to run models locally for the single-user case (batch=1). While llama.cpp supports different backends (RPi, CPU, GPU), I don't think it would be particularly fair to compare and show that MKML is better at a given perplexity, compression ratio, and speed on GPU for a multi-user case (batch >> 1), when that is not llama.cpp’s targeted use case (afaik). For example MKML achieves ~2700 tok/sec at batch 32 (i.e. 32 prompts in parallel) on a 4090 for a Llama-2 7B, with a ~4̶.̶2̶G̶B 5.2GB memory footprint, and perplexity that is ~fp16.

Also, we're not currently wrapping any open source tools or techniques for quantization. Everything is our own and there’s more news to come soon.

If anyone has specific technical questions I'd be happy to answer as best I can.

Cheers, Paul Merolla

3 comments

> A good example is that llama.cpp is a fantastic framework to run models locally for the single-user case (batch=1). While llama.cpp supports different backends (RPi, CPU, GPU), I don't think it would be particularly fair to compare and show that MKML is better at a given perplexity, compression ratio, and speed on GPU for a multi-user case (batch >> 1), when that is not llama.cpp’s targeted use case (afaik).

Maybe I'm misunderstanding MKML. As I understood it, MKML is a compression step that then feeds into another framework like HF's Transformers or PyTorch. If that's the case, then comparing MKML to llama.cpp is apples to oranges—the correct comparison would be to GGML and the various quantization methods. The inference engine and its intended use cases aren't what's in question here.

If a model compressed with MKML outperforms a standard quantized model in a batch setting, that's useful information! It would not be at all unfair for you to cite that as a strength, and it would increase your credibility because you wouldn't seem to be dodging the question of how you compare to your substitutes.

Appreciate your response.

We compared MKML mk600 (5.2GB) against llama.cpp Q5_1 (4.7GB) and Q6_k (5.1GB) on a 4090 for llama-7B. The test is the same in all cases: we generate 128 tokens from a single token prompt (batch=1) and measure performance of the forward pass during auto-regression.

(llama-7B, single prompt, batch=1)

MKML mk600: 125t/s

Llama.cpp Q5_1: 8̶4̶ 128 t/s

Llama.cpp Q6_k: 7̶8̶ 116 t/s

Our llama.cpp test: Build (https://github.com/ggerganov/llama.cpp#cublas):

make -j12 LLAMA_CUBLAS=1

Run:

./main -t 16 -ngl 3̶2̶ 35 -m llama-2-7b-chat.ggmlv3.q6_K.bin -p "?" -n 128

Please feel free to post your llama.cpp results if they are different.

>Maybe I'm misunderstanding MKML. As I understood it, MKML is a compression step that then feeds into another framework like HF's Transformers or PyTorch.

MKML is not a compression tool that feeds into another framework. It is an inference runtime (like FasterTransformers or vllm) except that MKML is also plug and play with existing frameworks like Hugging Face.

OK, so this is a case of bad measurement and comparison.

If you bothered to look at the llama.cpp output, you would see this line: llama_model_load_internal: offloaded 32/35 layers to GPU

The "-ngl 32" means that only 32 out of 35 layers are being run on the GPU, and this results in a huge slow down as the GPU syncs with the CPU, and then computes the last 3 layers on the CPU.

On my XTX 7900, I get a 55% speed up on llama.cpp (to 132.61 tok/sec) when running all layers on the GPU, rather than only 32 as in your measurements.

>The "-ngl 32" means that only 32 out of 35 layers are being run on the GPU, and this results in a huge slow down as the GPU syncs with the CPU, and then computes the last 3 layers on the CPU.

Thanks for the updated run configuration. It was a misunderstanding on our part about what llama.cpp considers “layers”, since layers are traditionally understood as learned parameter decoder layers (as they do in Hugging Face models). And, in this case the llama 7B model has 32 layers.

>On my XTX 7900, I get a 55% speed up on llama.cpp (to 132.61 tok/sec) when running all layers on the GPU, rather than only 32 as in your measurements.

On my 4090 I now get 128 t/s for Q5_1, and 116 t/s for Q6_k. So these are ballpark to mk600's 125 t/s for batch=1. Not surprising that different inference runtimes approach the same speed as they become memory bound for similar model sizes.

Something doesn't smell right.

Such sloppy errors with measurement and comparison (from people who are supposedly experts?), and cageyness about answering technical questions, reminds me of the era of crypto currency scams..

Agreed. This apples-to-apples comparison being obviously missing here is quite telling.
> [...] llama.cpp is a fantastic framework to run models locally for the single-user case (batch=1) > [...] I don't think it would be particularly fair to compare and show that MKML is better at a given perplexity, compression ratio, and speed on GPU for a multi-user case (batch >> 1)

Ok so you agree that llama.cpp etc are great for batch==1, right?

And I agree their targeted use case is not batch==32 (because who is doing that really?)

But if we extended llama.cpp or some other faster batch==1 implementation to support batch==32, why do you suppose it wouldn't still be faster than MKML? It seems to me that if you can do batch==1 faster, you could easily do batch>>1 faster too -- it is just that no one really needed that (yet?)

> If anyone has specific technical questions I'd be happy to answer as best I can.

What is the context size for these measurements? Is it the full 4k for llama-2? And just to be clear, when you say memory footprint, this is the entire memory foorprint right? Weights, 4k KV cache etc?

And more generally, I'm curious about the use case for running puny models like Llama-2 7B in the cloud on desktops GPUs (like 4090) with batch==32?