What does this show that we didn't know already? LLMs cannot provide accurate answers to questions where data is not included in their training sets. This doesn't appear to have much substance
LLMs can and will provide inaccurate answers to questions where data is included in their training sets too, that's in the nature of neural networks. It's just less likely that when the data is not in the training set...
I mean that's true but I don't think that's realistically what's going on when one model gives an unqualified "Yes" and the other gives an unqualified "no."
You can argue the study isn't as case-closed-decisive as we'd ideally like, but it's certainly evidence. It's probably hard to design a better study.
What are you talking about? The models were not ALLOWED to have confidence (or the lack thereof). They were explicitly told to give a single label, and in most cases, all of them were correct depending on additional context they would surely have provided, especially with access to the internet (which some didn't have). This is just silly.