Hacker News new | ask | show | jobs
by pmoxyz 131 days ago
using a complex environment like freeciv as your vehicle for benchmarking is impressive. but it also means you have a lot of confounding variables at play. how do you extract meaningful capability insight from that as opposed to simpler benchmarks like MMLU or GSM8K
1 comments

Like you said, theres a lot of complexity in the decision making here. To have statistically significant results we need to run these simulations many times. We record latency, tool calls, token consumption, etc. as well as results. Since we log the actions and their final outcomes we can run analysis later on the decisions correlations with success here. Our hypothesis is games provide an important benchmark for how these models will adapt in intelligence as they become more capable.

For example, I'm sure an RL bot will be able to figure out an optimal strategy over millions of simulations that defeats current LLMs with context, however this may not always hold true