| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lhl 1068 days ago

Until recently, exllama was significantly faster, but they're about on par now (with llama.cpp pulling ahead on certain hardware or with certain compile-time optimizations now even).

There are a couple big difference as I see it. llama.cpp uses `ggml` encoding for their models. There were a few weeks where they kept making breaking revisions which was annoying, but it seems to have stabilized and now also supports more flexible quantization w/ k-quants. exllamma was built for 4-bit GPTQ quants (compatible w/ GPTQ-for-LLaMA, AutoGPTQ) exclusively. exllama still had an advantage w/ the best multi-GPU scaling out there, but as you say, the projects are evolving quickly, so it's hard to say. It has a smaller focus/community than llama.cpp, which also has its pros and cons.

It's good to have multiple viable options though, especially if you're trying to find something that works best w/ your environment/hardware and I'd recommend anyone to HEAD checkouts a try for both and see which one works best for them.

1 comments

juliensalinas 1067 days ago

Thank you for the update! Do you happen to know if there are quality comparisons somewhere, between llama.cpp and exllama? Also, in terms of VRAM consumption, are they equivalent?

lhl 1062 days ago

ExLlama still uses a bit less VRAM than anything else out there: https://github.com/turboderp/exllama#new-implementation - this is sometimes significant since from my personal experience it can support full context on a quantized llama-33b model on a 24GB GPU that can OOM w/ other inference engines.

oobabooga recently did a direct perplexity comparison against various engines/quants: https://oobabooga.github.io/blog/posts/perplexities/

On wikitext, for llama-13b, the perplexity of a q4_K_M GGML on llama.cpp was within 0.3% of the perplexity of a 4-bit 128g desc_act GPTQ on ExLlama, so basically interchangeable.

There are some new quantization formats being proposed like AWQ, SpQR, SqueezeLLM that perform slightly better, but none have been implemented in any real systems yet (the paper for SqueezeLLM is the latest, and has comparison vs AWQ and SpQR if you want to read about it: https://arxiv.org/pdf/2306.07629.pdf)

abhinavkulkarni 1066 days ago

Here's one: https://huggingface.co/spaces/mike-ravkine/can-ai-code-resul...

juliensalinas 1063 days ago

Thank you.