Hacker News new | ask | show | jobs
by SwellJoe 1 hour ago
You're mixing up corpus selection and the benchmark. I possibly could have explained better.

In the benchmark the models were told to look at the file and were allowed to look at the rest of the repo, with no clues about what to look for.

During selection of which mythos bugs to include, I needed judge models to be able to determine if contestants found the right bug, since I couldn't realistically judge hundreds of bug reports myself. So, they were given the bug location and told to identify and explain it.