| The tournament measures the cumulative winnings. However, those can be far from the statistical expectation due to the variance of card distribution in poker. To establish a real winner, you need to play many games: > As seen in the Claudico match (20), even 80,000 games may not be enough to statistically significantly separate players whose skill differs by a considerable margin [1] It is possible to reduce the number of required games thanks to variance reduction techniques [1], but I don't think this is what the website does. To answer the question - "which 'quality' of the LLMs this tournament then actually measures" - since we can't tell the winner reliably, I don't think we can even make particular claims about the LLMs. However, it could be interesting to analyze the play from a "psychology profile perspective" of dark triad (psychopaths / machiavellians / narcissists).
Essentially, these personality types have been observed to prefer some strategies and this can be quantified [2]. [1] DeepStack, https://static1.squarespace.com/static/58a75073e6f2e1c1d5b36... [2] Generation of Games for Opponent Model Differentiation https://arxiv.org/pdf/2311.16781 |