Could you elaborate in specifics how you had been underestimating models? Ypu mean just using more tighter harnessing to make them work in structured agentic eay or something else?
The specific code I was working on, I had a general idea of the sort of performance improvement that would be possible. I just thought that it would be too hard for the models to figure out without a lot of hand-holding.
But it ended up being not "too hard ever", but more like, in 1 out of every 5 tries, the model did in fact manage to get a large refactoring to the point where it improved performance. So once I set it up to try something, use the perf test, see if it worked, if not, throw it away, repeat. Then it started, slowly, finding some useful things.
Just remember that the will do clever but useless things to improve. Like changing the random seed as per autoresearch's hero image. lol! imo, out of the box thinking is needed.
But it ended up being not "too hard ever", but more like, in 1 out of every 5 tries, the model did in fact manage to get a large refactoring to the point where it improved performance. So once I set it up to try something, use the perf test, see if it worked, if not, throw it away, repeat. Then it started, slowly, finding some useful things.