| HN Mirror

This was also my first thought, but reading [1] again, what they did was labeling like:

> Whether we consider the issue description to be underspecified and hence unfair to be testing on. > Whether the FAIL_TO_PASS unit tests filter out valid solution

and a bit more. This is pointed out in the linked paper too.

The moral of the story to me is that, don't believe the paid human annotator. You can (hopefully) still believe the PhD students doing these unpaid jobs as their research ;-)

[1] https://openai.com/index/introducing-swe-bench-verified/