There's actual SWE jobs where humans sift through this kind of noise. Someone told me they worked such a job recently.
It's a good tool to add pressure and raise expectations.
Maybe this is the future..
They only know the 22% number because unit tests to check for a fix are included in the benchmark. In other words, in a real world situation, the human would still need to double check. The patches this tool generates do not include appropriate tests or explanations and would never pass code review by a qualified human.