Fascinating, thanks for calling that out: I found 1.0 promising in practice, but with hallucination problems. Then I saw it had gotten 57% of questions wrong on open book true/false and I wrote it off completely - no reason to switch to it for speed and cost if it's just a random generator. That's a great outcome.
Speaking of which, I wonder how they'd do on SimpleQA. OpenAI is an outlier there in the negative sense vs Anthropic. This benchmark also deals with hallucination and "inappropriate certainty".