Y
Hacker News
new
|
ask
|
show
|
jobs
by
WhitneyLand
32 days ago
Could this task be a nice benchmark for computer use models?
Would interesting to see the success rate for Claude Cowork or Codex’s equivalent feature.
1 comments
pulse-dev
32 days ago
Good point, could be a solid benchmark. Sites are adversarially built to resist automation and success is verifiable later when records actually disappear, so harder to game than WebArena.
link