Hacker News new | ask | show | jobs
by twotwotwo 11 days ago
I'm liking the effort to make new, no-longer-saturated benchmarks. I'll also be a bit suspicious if some model aces it -- matching OSS maintainers' taste more often is a plausible improvement in quality but if they nail it every time they've been memorizing.

Not saying FrontierCode should've done this, but benchmarking the interaction would be interesting. That is, if I get a diff with a blocking problem but writing a comment gets fixed, that's a lot different from if the model has hit a wall. Better, if there's a problem but the model flagged it in a short list of questions or worries to me before or after coding, it can get sorted without taking much of my time. Stick an LLM in the loop instructed to behave like a user or reviewer with some rubric-ish info that wasn't in the prompt. Then, look at how much the pretend user has to do to get to a quality result with a given model, if they can get to one at all.

You could say 'why worry about interaction? the goal is the model just gets it perfect' but I think that imagined end state just is not a thing: tasks will get bigger but there will still be interaction. Handling comments and asking good clarifying questions when needed are real capabilities. Human SWEs interact plenty and real engineering has a certain density of questions about requirements, taste, and other big vague things.

1 comments

i agree it would be interesting but apart from the fact that its be harder to measure and automate, theres real alpha in being the best truly async, hands off model/agent, which is what cog has been working on for 2 years now. its not that im opposed to steering or interaction mid task, its just that 1) it mostly Just Works, 2) it doesnt parallelize/scale well, 3) including on proactive agents (https://docs.devin.ai/product-guides/automations).

see my “semi async valley of death” post. people are pursuing both sides but per bitter lesson only one side scales indefinitely with compute

that said, multistage rollouts and synhetic rubrics (using grpo advantage? see dr tulu paper) somewhat approximate human intervention and interaction, so theres known ways to model that, its just not thaaaat valuable

To repeat, not a dig at FrontierCode, which is substantial progress in benchmarking. But I'd argue modeling the rest of process is tha(aaa)t valuable and becomes more so as coding capability progresses:

Async agents interact on a longer timescale, but they interact. Again, experienced SWEs, consulting agencies, etc. ask questions before and after implementation, accept notes, etc.; they vary at how good they are at it; and how well they do it is a big factor in the success and failure of projects.

LLM interaction ability isn't saturated or mature; asking for point edits mostly works, but e.g. when I try to get Opus to ask clarifying questions or surface tricky bits to focus review, it's not close to a human-level response -- it's both noisy and misses key stuff. (Handling uncertainty has been a weak point for LLMs since early on, which might not help.) Other aspects of good interaction are even harder, like digging into a potentially mistaken request, or proposing a good 80/20 tweak to the spec.

There's a different, shorter-term reason to model interaction: it better tells users the value to expect now. It turns out my employer doesn't love infinite Opus use. (Go figure.) Kimi and Sonnet do comparably on FrontierCode. Are they about the same to use, or is one flailing while the other one just needs a couple rounds of fixups? If I saw a benchmark that credibly approximated 'this model will save you this much time vs. that one' that would put it well above existing ones.

I do think a bunch of discussion, investment, etc. is based on the idea the industry will essentially be replaced with successful one-shotting with little interaction. The mistake there is to assume back-and-forth is inessential and only happens because the agents aren't that good at coding yet. For a long time lots back-and-forths were driven by the models' limitations at raw coding, which might've made that idea more appealing.

As the coding side gets better, drawing the rest of the owl becomes the hard part. The world is messy and so is one's software's boundary with it. (I'm not saying the tasks don't get longer, I'm saying interaction gets more important as they do.) My conviction here might partly because in my sort of work the requirements and big picture were always thornier than typing the code; I'm suspicious that as raw coding gets easier for everybody they will hit something analogous.

Anyway, again, what y'all are doing is progress. I do want to stick up for the idea that a lot of critical things aren't raw coding ability. (I'm not alone in that, I don't think!) I'm definitely not here to say someone's Doing It Wrong as they do it more correctly than I've seen it done--just asking "would the patch get accepted?" is a huge step.

no disagreements. big fan of thinky