| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by eldenring 1031 days ago
	GPT-3.5 is much, much smarter than Llama2. Its not nearly as close as the benchmarks make it seem.

2 comments

Tostino 1031 days ago

So, as somebody who has fine tuned llama2 (13b) on both a new prompt template / chat format, as well as instruction following, summarization, knowledge graph creation, traversing a knowledge graph for information, describing relationships in the knowledge graph, etc.

It is able to use the knowledge graph to write coherent text that is well structured, lengthy, and follows the connections outlined in the graph to the logical conclusions, while deriving non-explicit insights from the graph in it's writings.

Just to say, i've seen a giant improvement in performance from Llama2 by fine tuning. And like I said, just 13b...I am perfecting the dataset with 13b before moving to 70b.

3.5-turbo is sometimes okay, i've tested it moderately for the same tasks i've been training/testing Llama2 on, and it's just a bit behind. Honestly, my fine tune is more consistent than gpt4 for a good number of the tasks i've trained.

link

fullstackchris 1030 days ago

but how is the speed here? does it feel fast "enough"?

looking into to running llama on prem / private cloud but i have no idea where to start in terms of sizing, do you have any details or posts on to what the minimum / recommended hardware requirements are?

EDIT: just looked myself, not as encouraging as I'd like: "For good results, you should have at least 10GB VRAM at a minimum for the 7B model, though you can sometimes see success with 8GB VRAM. The 13B model can run on GPUs like the RTX 3090 and RTX 4090"

definitely borderline dealbreaking for solo hackers / small teams

link

Tostino 1030 days ago

1x 3090 IMO is about the minimum you'd want to waste time with. It can serve a 13b + 7b model at once if you want, you can qlora train a 13b with a ton of context length, and it's fast enough to iterate with for training.

I have 2x 3090 in my machine, and I can do inference of ~40tokens/sec on a 13b llama2 model on one card. I can split the 70b parameter model between the two cards and get ~12-15tokens/sec. I can't train the 70b parameter model with my 2x 3090 though sadly, not quite enough vram.

link

jron 1031 days ago

Did you opt for LORA or did you tune all of the layers?

link

Tostino 1031 days ago

I opted for lora (qlora), but I targeted all layers with it.

link

jron 1031 days ago

Thanks for the reply. I'm far more interested in open-ish or fully open models so your post is really encouraging.

link

intellectronica 1031 days ago

Indeed, and this is really missing from the public discourse. People are talking about Llama 70b as if it was a drop-in replacement for gpt-3.5, but you only have to play with both for half an hour to figure out that's not generally the case and only looks true in cherry-picked examples.

link