|
|
|
|
|
by gertlabs
62 days ago
|
|
In our benchmarks we exclusively use a custom harness for measuring tool capability. It has common tools that any harness would have, like a thin wrapper around shell commands, basic file editors, etc. but an important part of agentic intelligence is adapting to new tools. Frontier models are already quite adaptable, especially Anthropic models, and improving with each release. I think a standardized format will become less and less important over time. Benchmarks at https://gertlabs.com |
|
The only case where a standard wouldn't win is the case where models are only capable of supporting the baked in format but even this could be solved by adopting a standard format.