|
|
|
|
|
by pmoxyz
131 days ago
|
|
using a complex environment like freeciv as your vehicle for benchmarking is impressive. but it also means you have a lot of confounding variables at play. how do you extract meaningful capability insight from that as opposed to simpler benchmarks like MMLU or GSM8K |
|
For example, I'm sure an RL bot will be able to figure out an optimal strategy over millions of simulations that defeats current LLMs with context, however this may not always hold true