| HN Mirror

Ah you're right, LATS GPT3.5 is 84 while standalone GPT4 is 87

Given standalone GPT3.5 is "just" 48.. it's less about beating and more about meeting

RE:Good Enough & Feel... very much agreed. I find it very task dependent!

For example, GPT4 is 'good enough' that developers are comfortable copy-pasting & trying, even vs stack overflow results. We haven't seen LATS+MagicCoder yet, but as MagicCoder 7b already meets+exceeds GPT3.5 for HumanEval, there's a plausible hope for agent-aided GPT4-grade tools being always-on for all coding tasks, and sooner vs later. We made that bet for Louie.AI's interactive analyst interface, and as each month passes, evidence mounts. We can go surprisingly far with GPT3.5 before wanting to switch to GPT4 for this kind of interaction scenario.

Conversely... I've yet to see a true long-running autonomous coding autoGPT where the error rate doesn't kill it. We're experimenting with design partners on directions here -- think autonomous investigations etc -- but there's more on the advanced fringe and with special use cases, guard rails, etc. For most of our users and use cases... we're able to more reliably deliver -- today -- on the interactive scenarios with smaller snippets.