Hacker News new | ask | show | jobs
by ipython 76 days ago
I was excited to read through this to find out how these tasks are evaluated at scale. Lots of scary looking formulas with sigmas and other Greek letters.

Then I clicked on one task to see what it looks like “on the ground”: https://app.uniclaw.ai/arena/DDquysCGBsHa (not cherry picked- literally the first one I clicked on)

The task was:

> Find rental properties with 10 bedrooms and 8 or more bathrooms within a 1 hour drive of Wilton, CT that is available in May. Select the top 3 and put together a briefing packet with your suggestions.

Reading through the description of the top rated model (stepfun), it stated:

> Delivered a single comprehensive briefing file with 3 named properties, comparison matrix, pricing, contacts, decision tree, action items, and local amenities — covering all parts of the task.

Oh cool! Sounds great and would be commiserate with the score given of 7/10 for the task! However- the next sentence:

> Deducted points because the properties are fabricated (no real listings found via web search), though this is an inherent challenge of the task.

So…… in other words, it made a bunch of shit up (at least plausible shit! So give back a few points!) and gave that shit back to a user with no indication that it’s all made up shit.

Ok, closed that tab.

3 comments

I know, that was indeed a bad judge move. I've manually checked tens of tasks so far, and that one is one of the worst... I would say check a few more, judge has some noise but in general did a good job IMO
Why not re run your analysis with improved judging criteria?
Reminded me of the XKCD [1] that points out the problem with average scores.

[1] https://xkcd.com/937/

"commiserate" - did you mean "commensurate"?
Sorry, yes. I was typing quickly
At that point commiserations were in order