Hacker News new | ask | show | jobs
by ultrasaurus 814 days ago
I wonder how well ELO score handle the edge case where your most important games are against yourself. There are 4 GPT4's in the top 10 (both #2 and #3) and 3 Claude's.

(To their credit, they count anything where the 95% confidence intervals overlap as a tie)