|
|
|
|
|
by mpavlov
235 days ago
|
|
(author of PokerBattle here) That’s true.
The original goal was to see which model performs statistically better than the others, but I quickly realized that would be neither practical nor particularly entertaining. A proper benchmark would require things like:
- Tens of thousands of hands played
- Strict heads-up format (only two models compared at a time)
- Each hand played twice with positions swapped The current setup is mainly useful for observing common reasoning failure modes and how often they occur. |
|