Hacker News new | ask | show | jobs
by kir-gadjello 744 days ago
While llama3-8b might be slightly more brittle under quantization, llama3-70b really surprised myself and others[1] in how well it performs even in the 2..3 bits per parameter regime. It requires one of the most advanced quantization methods (IQ2_XS specifically) to work, but the reward is a SoTA LLM that fits on one 4090 GPU with 8K context (KV-cache uncompressed btw) and allows for advanced usecases such as powering the agent engine I'm working on: https://github.com/kir-gadjello/picoagent-rnd

For me it completely replaced strong models such as Mixtral-8x7B and DeepSeek-Coder-Instruct-33B.

1. https://www.reddit.com/r/LocalLLaMA/comments/1cst400/result_...

3 comments

How does it compare against unequalised Llama 3 8B at 16fp? I’ve been using that locally and it’s almost replaced GPT4 for me. Runs in about 14GB of VRAM.
llama3 is nowhere near gpt4, though it is cool
What is your use case where you find it comparable to gpt4?
For creative tasks, for example, Llama 3 is much better. GPT-4 is very sterile, Llama is much more whimsical, and has a lot more character.
> For creative tasks

What, specifically, are you asking of these LLMs? "creative tasks" can be anything from programming to cooking recipes, so a tiny bit more specificality would be appreciated :)

Sorry, I meant making up stuff, e.g. creative writing.
I've used pretty much every major LLM out there for a specific type of creative writing, and none of them are as good at it as GPT4 with the exception of maybe Claude (Opus is actually probably even better regarding the sterility). Llama 3, even 70b, is definitely not better by any measure of actual quality - it's more random, at best.
So like making presentations? Composing poems? Writing novellas/novels? Technical articles?
> surprised myself and others[1] in how well it performs even in the 2..3 bits per parameter regime

I am too dumb for all of this ML stuff. Can you explain what exactly that means & why it's surprising?

Artificial neural networks work the following way: you have a bunch of “neurons” which have inputs and an output. Neuron’s inputs have weights associated with them, the larger the weight, the more influence the input has on the neuron. These weights need to be represented in our computers somehow, usually people use IEEE754 floating point numbers. But these numbers take a lot of space (32 or 16 bits). So one approach people have invented is to use more compact representation of these weights (10, 8, down to 2 bits). This process is called quantisation. Having a smaller representation makes running the model faster because models are currently limited by memory bandwidth (how long it takes to read weights from memory), going from 32 bits to 2 bits potentially leads to 16x speed up. The surprising part is that the models still produce decent results, even when a lot of information from the weights was “thrown away”.
Oh, nice. Thank you for this explanation. Now i think i get quantisation. Very well explained for someone like me. Thank you a lot!
Holy wow. Thank you for this. Very cool. I’ve been using 8b for things it might be worth using 70b for.