| HN Mirror

To repeat, not a dig at FrontierCode, which is substantial progress in benchmarking. But I'd argue modeling the rest of process is tha(aaa)t valuable and becomes more so as coding capability progresses:

Async agents interact on a longer timescale, but they interact. Again, experienced SWEs, consulting agencies, etc. ask questions before and after implementation, accept notes, etc.; they vary at how good they are at it; and how well they do it is a big factor in the success and failure of projects.

LLM interaction ability isn't saturated or mature; asking for point edits mostly works, but e.g. when I try to get Opus to ask clarifying questions or surface tricky bits to focus review, it's not close to a human-level response -- it's both noisy and misses key stuff. (Handling uncertainty has been a weak point for LLMs since early on, which might not help.) Other aspects of good interaction are even harder, like digging into a potentially mistaken request, or proposing a good 80/20 tweak to the spec.

There's a different, shorter-term reason to model interaction: it better tells users the value to expect now. It turns out my employer doesn't love infinite Opus use. (Go figure.) Kimi and Sonnet do comparably on FrontierCode. Are they about the same to use, or is one flailing while the other one just needs a couple rounds of fixups? If I saw a benchmark that credibly approximated 'this model will save you this much time vs. that one' that would put it well above existing ones.

I do think a bunch of discussion, investment, etc. is based on the idea the industry will essentially be replaced with successful one-shotting with little interaction. The mistake there is to assume back-and-forth is inessential and only happens because the agents aren't that good at coding yet. For a long time lots back-and-forths were driven by the models' limitations at raw coding, which might've made that idea more appealing.

As the coding side gets better, drawing the rest of the owl becomes the hard part. The world is messy and so is one's software's boundary with it. (I'm not saying the tasks don't get longer, I'm saying interaction gets more important as they do.) My conviction here might partly because in my sort of work the requirements and big picture were always thornier than typing the code; I'm suspicious that as raw coding gets easier for everybody they will hit something analogous.

Anyway, again, what y'all are doing is progress. I do want to stick up for the idea that a lot of critical things aren't raw coding ability. (I'm not alone in that, I don't think!) I'm definitely not here to say someone's Doing It Wrong as they do it more correctly than I've seen it done--just asking "would the patch get accepted?" is a huge step.