Hacker News new | ask | show | jobs
by 3abiton 911 days ago
Thanks for the reference I was searching for a benchmark that can quantify the typical user experience, as most synthetic ones are completly ineffective. At what sample size the ranking become significant? Or is it baked in the metrics (ELO)?
1 comments

Elo converges on stable scores fairly quickly, depending on the K-factor. I wouldn't think it would be much of an issue at all for something like this, since you can ensure you're testing against every other member (avoiding "Elo islands"). But obviously the more trials the better.

The Glicko rating system is very similar to Elo, but it also models the variance of a given rating. It can directly tell you a "rating deviation."