|
|
|
|
|
by brucethemoose2
1040 days ago
|
|
I can confirm this, mlc is shockingly fast on my RTX 2060. The catch is: - MLC's quantization is somewhat different (though I havent run any perplexity tests yet) - There is no CPU offloading (or splitting onto an IGP) like Llama.cpp yet (unless its new and I missed it). |
|
Regarding quantization, we wanted to develop a code path that absorbs any quantization formats, for example, those from GGML or GPTQ, so that they could be all used. ML compilation (MLC) is agnostic to any quantization formats, but we just haven't exposed such abstractions yet.
On CPU offloading, imagine if you are writing PyTorch, it should be as simple as a one-liner `some_tensor.cpu()` to bring something down to host memory, and `some_tensor.cuda()` to get it back to CUDA - seems a low-hanging fruit but it's not implemented yet in MLC LLM :( Lots of stuff to do and we should make this happen soon.