Hacker News new | ask | show | jobs
by overfeed 40 days ago
> What kind of products/services are you building where you aren't able to tie your eval suite to business value?

There are no evals in my org that can quantify the value of a proposed feature, rank it against ongoing support issues that pop up, or know when to stop expending effort when no solution has been found or too many unknowns crop up. We still rely on natural intelligence for that, and haven't YOLO'd (ha) on Independent agents. I'd rather quit than spend my day herding agents and have my job reduced to just a code-review monkey.

Benchmark evals are at least 3 degrees removed from actual business value - maybe less of your tasks are repetitive. None of the harnesses I've used have a sense of a compute budget - outside of Boolean think/no-thinking modes.