Hacker News new | ask | show | jobs
by pants2 784 days ago
I'd be interested to hear how Llama 8B with long chain-of-thought prompts compares to GPT-4 one-shot prompts for real-world tasks.

In classification for example, you could ask Llama 8B to reason through each possibility, rank them, rate them, make counterarguments, etc. - all in the same time that GPT-4 would take to output one classification without reasoning. Which does better?

2 comments

I did that with Llama 3 8B with some stuff i could think of, and it did very good. It was on par with GPT4. I prompted some scenarios and asked it to use CoT. Scenarios like "i was standing and eating chocolate, and it melted. Will i find chocolate at my feet?", and the reasoning was pretty good.

But there was something it did way better than GPT4. I asked to create 10 phrases where the last word was an animal, excluding equines, and in alphabetical order. GPT3.5 and GPT4 aren't able to follow such instructions, but the 8b model did it with maestry.

Good idea, that could make for a pretty interesting eval. It's similar to a timed test... we don't really care how long it takes or how much scratch paper you needed as long as you deliver the correct answer within the time limit.