| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tails4e 1042 days ago
	When you say best performance on nvidia, do you mean against any other method of running this model an nvidia card?

2 comments

brucethemoose2 1042 days ago

I can confirm this, mlc is shockingly fast on my RTX 2060.

The catch is:

- MLC's quantization is somewhat different (though I havent run any perplexity tests yet)

- There is no CPU offloading (or splitting onto an IGP) like Llama.cpp yet (unless its new and I missed it).

link

junrushao1994 1042 days ago

True and there are some other issues to be addressed. Those two particular issue is on our roadmap.

Regarding quantization, we wanted to develop a code path that absorbs any quantization formats, for example, those from GGML or GPTQ, so that they could be all used. ML compilation (MLC) is agnostic to any quantization formats, but we just haven't exposed such abstractions yet.

On CPU offloading, imagine if you are writing PyTorch, it should be as simple as a one-liner `some_tensor.cpu()` to bring something down to host memory, and `some_tensor.cuda()` to get it back to CUDA - seems a low-hanging fruit but it's not implemented yet in MLC LLM :( Lots of stuff to do and we should make this happen soon.

link

junrushao1994 1042 days ago

yeah we tried out popular solutions like exllama and llama.cpp among others that support inference of 4bit quantized models

link