|
|
|
|
|
by timabdulla
511 days ago
|
|
Those numbers are not the full story. Note that GP specifically says: "Big jumps in benchmarks from _Claude's Computer Use_ though." Claude Computer Use was not SOTA for browser tasks at the time of its release (and is still not.) In WebArena, Operator does 58.1%. Previous SOTA for browser-use agents is 57.1%.
In WebVoyager, Operator does 87.0%. Previous SOTA for browser-use agents is the exact same. See here for details: https://openai.com/index/computer-using-agent/ |
|