| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kqr 222 days ago
	It would unfortunately also need several runs of each to be reliable. There's nothing in TFA to indicate the results shown aren't to a large degree affected by random chance! (I do think from personal benchmarks that Gemini 3 is better for the reasons stated by the author, but a single run from each is not strong evidence.)

1 comments

TFA says multiple times that the results are affect by random chance

Yes, but recognising that is only the first step. Quantifying the variance is the next step which I miss in the article.