|
|
|
|
|
by nkko
107 days ago
|
|
FWIW I work at Steel (not the OP). While we’ve been iterating on the “right shape” for agent tooling, I’ve been building a benchmark harness to measure how different surfaces affect real web task completion: raw API context, CLI-only, opinionated “skills” (structured outputs + artifact capture), and combinations. If you’ve run agents on the open web, I’d love suggestions for nasty-but-representative workflows to include in the benchmark. |
|