I mean that's true but I don't think that's realistically what's going on when one model gives an unqualified "Yes" and the other gives an unqualified "no."
You can argue the study isn't as case-closed-decisive as we'd ideally like, but it's certainly evidence. It's probably hard to design a better study.
What are you talking about? The models were not ALLOWED to have confidence (or the lack thereof). They were explicitly told to give a single label, and in most cases, all of them were correct depending on additional context they would surely have provided, especially with access to the internet (which some didn't have). This is just silly.