| I guess I'd be interested to see how this performs against the same benchmark Devin was using. It's hard to deny that this isn't impressive. But I think there's two interesting parts to it. Claude 3 Opus already scored around 85-86% on these benchmarks, without an "AutoDev" style agentic approach. And all the same problems with HumanEval remain, the limitations in terms of what style of problems are chosen, and real world relevance. I hate writing these styles of comments because I'm acutely aware that a part of me is just worried. Worried about the speed of progress and worried about a changing landscape. But I still wonder how much of this stuff is going to be transferrable to a real life software context. |
I can see LLMs eating into the expert regime IF they get another 5-10x better. But even in that case human (expert) knowledge will be required to know what is possible and hence what to ask (kind of like reward function design in reinforcement learning)