| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nkko 107 days ago
	FWIW I work at Steel (not the OP). While we’ve been iterating on the “right shape” for agent tooling, I’ve been building a benchmark harness to measure how different surfaces affect real web task completion: raw API context, CLI-only, opinionated “skills” (structured outputs + artifact capture), and combinations. If you’ve run agents on the open web, I’d love suggestions for nasty-but-representative workflows to include in the benchmark.