| HN Mirror

Like you said, theres a lot of complexity in the decision making here. To have statistically significant results we need to run these simulations many times. We record latency, tool calls, token consumption, etc. as well as results. Since we log the actions and their final outcomes we can run analysis later on the decisions correlations with success here. Our hypothesis is games provide an important benchmark for how these models will adapt in intelligence as they become more capable.

For example, I'm sure an RL bot will be able to figure out an optimal strategy over millions of simulations that defeats current LLMs with context, however this may not always hold true