Hacker News new | ask | show | jobs
by m101 5 hours ago
I've been making an auction site and have been using an AI swarm to test it: sellers, intermediaries, buyers, market practices/norms etc. I was mostly using GPT 5.5 xhigh to code up the scenario, and looping over it to check with opus 4.8.

Out of curiosity I asked Fable to review it all and I was shocked to find that there were a lot of blindingly obvious common sense mistakes that got through, for example:

- all intermediaries were given the prices of all buyers up front

- private price information in certain auction types was actually being broadcast to everyone

- multiple contradictions in instructions

If it was any one of these things then I might have understood - but the fact that so many got passed both Opus and GPT 5.5 makes me think that Fable has something special. This is a common sense type thing, that I think you only get to notice when your task doesn't involve a measurable metric, but rather some sort of real world fuzzy task.

There's clearly a problem with all these measures of performance when the difference between these models was night and day in my specific task.

3 comments

Maybe you are something special by letting those slip through in the first place?..
GP literally caught them?
Prompt: can you reformat your sentence to be less unkind?
This seems like the exact project you should try out Codex Security for. It catches a lot of stuff:

https://chatgpt.com/codex/cloud/security/

> ... and I was shocked to find that there were a lot of blindingly obvious common sense mistakes that got through

Wait... Are you telling me models everybody told me were better than coders up to just one month ago are actually making lots of mistakes?

This is shocking.