| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kir-gadjello 792 days ago

While llama3-8b might be slightly more brittle under quantization, llama3-70b really surprised myself and others[1] in how well it performs even in the 2..3 bits per parameter regime. It requires one of the most advanced quantization methods (IQ2_XS specifically) to work, but the reward is a SoTA LLM that fits on one 4090 GPU with 8K context (KV-cache uncompressed btw) and allows for advanced usecases such as powering the agent engine I'm working on: https://github.com/kir-gadjello/picoagent-rnd

For me it completely replaced strong models such as Mixtral-8x7B and DeepSeek-Coder-Instruct-33B.

1. https://www.reddit.com/r/LocalLLaMA/comments/1cst400/result_...

3 comments

d13 792 days ago

How does it compare against unequalised Llama 3 8B at 16fp? I’ve been using that locally and it’s almost replaced GPT4 for me. Runs in about 14GB of VRAM.

link

iwontberude 792 days ago

llama3 is nowhere near gpt4, though it is cool

link

LordDragonfang 792 days ago

What is your use case where you find it comparable to gpt4?

link

stavros 792 days ago

For creative tasks, for example, Llama 3 is much better. GPT-4 is very sterile, Llama is much more whimsical, and has a lot more character.

link

CaptainOfCoit 792 days ago

> For creative tasks

What, specifically, are you asking of these LLMs? "creative tasks" can be anything from programming to cooking recipes, so a tiny bit more specificality would be appreciated :)

link

stavros 792 days ago

Sorry, I meant making up stuff, e.g. creative writing.

link

LordDragonfang 790 days ago

I've used pretty much every major LLM out there for a specific type of creative writing, and none of them are as good at it as GPT4 with the exception of maybe Claude (Opus is actually probably even better regarding the sterility). Llama 3, even 70b, is definitely not better by any measure of actual quality - it's more random, at best.

link

CaptainOfCoit 791 days ago

So like making presentations? Composing poems? Writing novellas/novels? Technical articles?

link

endofreach 792 days ago

> surprised myself and others[1] in how well it performs even in the 2..3 bits per parameter regime

I am too dumb for all of this ML stuff. Can you explain what exactly that means & why it's surprising?

link

m1el 792 days ago

Artificial neural networks work the following way: you have a bunch of “neurons” which have inputs and an output. Neuron’s inputs have weights associated with them, the larger the weight, the more influence the input has on the neuron. These weights need to be represented in our computers somehow, usually people use IEEE754 floating point numbers. But these numbers take a lot of space (32 or 16 bits). So one approach people have invented is to use more compact representation of these weights (10, 8, down to 2 bits). This process is called quantisation. Having a smaller representation makes running the model faster because models are currently limited by memory bandwidth (how long it takes to read weights from memory), going from 32 bits to 2 bits potentially leads to 16x speed up. The surprising part is that the models still produce decent results, even when a lot of information from the weights was “thrown away”.

link

endofreach 791 days ago

Oh, nice. Thank you for this explanation. Now i think i get quantisation. Very well explained for someone like me. Thank you a lot!

link

renewiltord 792 days ago

Holy wow. Thank you for this. Very cool. I’ve been using 8b for things it might be worth using 70b for.

link