Hacker News new | ask | show | jobs
by tails4e 1042 days ago
When you say best performance on nvidia, do you mean against any other method of running this model an nvidia card?
2 comments

I can confirm this, mlc is shockingly fast on my RTX 2060.

The catch is:

- MLC's quantization is somewhat different (though I havent run any perplexity tests yet)

- There is no CPU offloading (or splitting onto an IGP) like Llama.cpp yet (unless its new and I missed it).

True and there are some other issues to be addressed. Those two particular issue is on our roadmap.

Regarding quantization, we wanted to develop a code path that absorbs any quantization formats, for example, those from GGML or GPTQ, so that they could be all used. ML compilation (MLC) is agnostic to any quantization formats, but we just haven't exposed such abstractions yet.

On CPU offloading, imagine if you are writing PyTorch, it should be as simple as a one-liner `some_tensor.cpu()` to bring something down to host memory, and `some_tensor.cuda()` to get it back to CUDA - seems a low-hanging fruit but it's not implemented yet in MLC LLM :( Lots of stuff to do and we should make this happen soon.

yeah we tried out popular solutions like exllama and llama.cpp among others that support inference of 4bit quantized models