Hacker News new | ask | show | jobs
by harlanlewis 934 days ago
> GPT4 is getting beat by the small prompt library LATS wrapped around GPT3.5

The link you shared doesn’t quite reflect this. Omitting other models…

LATS (gpt-4): 94.4 Reflexion (gpt-4): 91.0 gpt-4: 86.6 … LATS (gpt-3.5): 83.8 … zero-shot (gpt-4): 67.0 zero-shot (gpt-3.5): 48.1

I’m not quite sure how to translate leaderboards like these into actual utility, but it certainly feels like “good enough” is only going to get more accessible and I agree with what I think is your broader point - more sophisticated techniques will make small, affordable, self-hostable models viable in their own right.

I’m optimistic we’re on a path where further improvement isn’t totally dependent on just throwing money at more parameters.

1 comments

Ah you're right, LATS GPT3.5 is 84 while standalone GPT4 is 87

Given standalone GPT3.5 is "just" 48.. it's less about beating and more about meeting

RE:Good Enough & Feel... very much agreed. I find it very task dependent!

For example, GPT4 is 'good enough' that developers are comfortable copy-pasting & trying, even vs stack overflow results. We haven't seen LATS+MagicCoder yet, but as MagicCoder 7b already meets+exceeds GPT3.5 for HumanEval, there's a plausible hope for agent-aided GPT4-grade tools being always-on for all coding tasks, and sooner vs later. We made that bet for Louie.AI's interactive analyst interface, and as each month passes, evidence mounts. We can go surprisingly far with GPT3.5 before wanting to switch to GPT4 for this kind of interaction scenario.

Conversely... I've yet to see a true long-running autonomous coding autoGPT where the error rate doesn't kill it. We're experimenting with design partners on directions here -- think autonomous investigations etc -- but there's more on the advanced fringe and with special use cases, guard rails, etc. For most of our users and use cases... we're able to more reliably deliver -- today -- on the interactive scenarios with smaller snippets.