|
Hi, one of the founders here. Attempting to address some of the comments in a single message. To help understand why we decided not to compare to existing methods: I think it would be difficult to do so fairly, since there are many tradeoffs and different use cases. It's not always the case that one technique is bad and the other is good, it's more about the targeted design point (say, cloud vs local). We are openly offering our numbers / benchmarks and looking for early partners that are aligned with our current value proposition (hence the closed beta). A good example is that llama.cpp is a fantastic framework to run models locally for the single-user case (batch=1). While llama.cpp supports different backends (RPi, CPU, GPU), I don't think it would be particularly fair to compare and show that MKML is better at a given perplexity, compression ratio, and speed on GPU for a multi-user case (batch >> 1), when that is not llama.cpp’s targeted use case (afaik). For example MKML achieves ~2700 tok/sec at batch 32 (i.e. 32 prompts in parallel) on a 4090 for a Llama-2 7B, with a ~4̶.̶2̶G̶B 5.2GB memory footprint, and perplexity that is ~fp16. Also, we're not currently wrapping any open source tools or techniques for quantization. Everything is our own and there’s more news to come soon. If anyone has specific technical questions I'd be happy to answer as best I can. Cheers,
Paul Merolla |
Maybe I'm misunderstanding MKML. As I understood it, MKML is a compression step that then feeds into another framework like HF's Transformers or PyTorch. If that's the case, then comparing MKML to llama.cpp is apples to oranges—the correct comparison would be to GGML and the various quantization methods. The inference engine and its intended use cases aren't what's in question here.
If a model compressed with MKML outperforms a standard quantized model in a batch setting, that's useful information! It would not be at all unfair for you to cite that as a strength, and it would increase your credibility because you wouldn't seem to be dodging the question of how you compare to your substitutes.