|
|
|
|
|
by lmeyerov
71 days ago
|
|
Evals let us agree on the baseline, measurement, etc, and compare if simple things others do perform just as well. For same reason, instead of 'works on my box' and 'my coding style', use one of the many community evals vs making up your own benchmark. That helps head off much of many of the unfalsifiable discussions & claims happening and moves everyone forward. |
|
natively claude (and other LLM) will resolve conflicting claims at about 51% rate (based on internal research)
the built in byzantine fault tolerance (again, in the compiler) is also pretty remarkable, it can correctly find the right answer even if 93% of the agents/data are malicious (with only 7% of agents/data telling us the correct information)
basically the idea here is if you want to build autonomous at scale, you need to be able to resolve disagreement at scale and this project does a pretty nice job at doing that