|
|
|
|
|
by brookst
335 days ago
|
|
> IMO you can never use an AI agent benchmark that is published on the internet more than once. This is a long-solved problem far predating AI. You do it by releasing 90% of the benchmark publicly and holding back 10% for yourself or closely trusted partners. Then benchmark performance can be independently evaluated to determine if performance on the 10% holdback matches the 90% public. |
|