Hacker News new | ask | show | jobs
by michalsustr 237 days ago
The tournament measures the cumulative winnings. However, those can be far from the statistical expectation due to the variance of card distribution in poker.

To establish a real winner, you need to play many games:

> As seen in the Claudico match (20), even 80,000 games may not be enough to statistically significantly separate players whose skill differs by a considerable margin [1]

It is possible to reduce the number of required games thanks to variance reduction techniques [1], but I don't think this is what the website does.

To answer the question - "which 'quality' of the LLMs this tournament then actually measures" - since we can't tell the winner reliably, I don't think we can even make particular claims about the LLMs.

However, it could be interesting to analyze the play from a "psychology profile perspective" of dark triad (psychopaths / machiavellians / narcissists). Essentially, these personality types have been observed to prefer some strategies and this can be quantified [2].

[1] DeepStack, https://static1.squarespace.com/static/58a75073e6f2e1c1d5b36...

[2] Generation of Games for Opponent Model Differentiation https://arxiv.org/pdf/2311.16781