|
|
|
|
|
by Oxlamarr
54 days ago
|
|
The harness point is probably the most important part here. With terminal agents, it often feels like the benchmark is measuring the whole loop: model, prompt, tool interface, retry policy, timeout handling, and recovery from bad shell commands. Do you have a sense of which part contributed most to the jump? |
|