| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gertlabs 21 days ago

Check out the methodology section at the bottom -- we are trying to better convey this information.

1. These numbers are based on percentiles, which inherently can't be saturated. Most benchmarks operate on something like 0-100% of correct answers, so it's natural to make that assumption when you see our numbers. Perhaps we should divide by 100. We create a modified score based on percentiles against other agents, which rebalances every time we add new entries. So when a new frontier model comes out, all of the existing entries get downweighted if the new model outperforms them. And MiMo V2.5 Pro is a much stronger model than people realize.

2. Agents write code to play most of these games (accounting for ~80% of the combined bench score). There is increasing evidence that nearly identical patterns of weights emerge in different models, trained on different mediums and using different algorithms. Pattern matching and extrapolation don't care if the scenario is a 3D "game" environment or a Salesforce "work RL" environment. Examples of drawing distant connections in different domains can reward similar circuitry.