Hacker News new | ask | show | jobs
by binarymax 79 days ago
I disagree that evaluation is always a coding task. Evaluation is scrutiny for the person who wants the thing. It’s subjective. So, unless you’re evaluating something purely objective, such as an algorithm, I don’t see how a self contained, self “improving “ agent accomplishes the subjectivity constraint - as by design you are leaving out the subject.
2 comments

Sure. There will always be subjective tasks where the person who asks for something needs to give feedback. But even there we could come up with ways to make it easier / faster / better ux. (one example I saw my frontend colleagues do is use a fast model to create 9 versions of a component, in a grid. And they "at a glance" decide which one is "better", and use that going forwards).

OTOH, there's loads you can do for evaluation before a human even sees the artifact. Things like does the site load, does it behave the same, did anything major change on the happy path, etc etc. There's a recent-ish paper where instead of classic "LLM as a judge" they used LLMs to come up with rubrics, and other instances check original prompt + rubrics on a binary scale. Saw improvements in a lot of evaluations.

Then there's "evaluate by having an agent do it" for any documentation tracking. Say you have a project, you implement a feature, and document the changes. Then you can have an agent take that documentation and "try it out". Should give you much faster feedback loops.

> Things like does the site load, does it behave the same, did anything major change on the happy path, etc etc.

I asked Claude to build a web app to run locally polling data from the LAN. It fought me for four rounds of me telling it that the data from the api wasn’t rendered on the page. It created tests with mock data, it validated the api, it tested that the page loaded. It was gaslighting telling me that everything worked every time I told it that it didn’t work. I had to tell it to inspect the dom and take screenshots with Playwright to make it stop effing around. I don’t think it ever would have found the right response on its own.

Even after deliberate intervention, it regressed a few rounds later and stopped caring that tests failed. Whatever, I don’t treat it as anything more than a sometimes-correct random output machine.

The thing you're missing is harness engineering.
In science there are ways to surface subjectivity (cannot be counted) into observable quantized phenomena. Take opinion polls for instance: "approval" of a political figure can mean many things and is subjective, but experts in the field make "approval" into a number through scientific methods. These methods are just an approximation and have many IFs, they're not perfect (and for presidential campaign analysis in particular they've been failing for reasons I won't clarify here), but they're useful nonetheless.

Another thing that get quantized is video preferences to maximize engagement.