Y
Hacker News
new
|
ask
|
show
|
jobs
by
Retr0id
102 days ago
> As proof, ABP with opus 4.6 as the driver scores 90.5% on the Online Mind2Web benchmark
And what does opus score with "regular" browser harnesses?
2 comments
esafak
102 days ago
https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderb...
link
Retr0id
102 days ago
Hm I can't see Opus 4.6 on there
link
theredsix
102 days ago
I tweeted at the OSUNLP and they're backed up on eval validation. In the meantime, here's the benchmark repo with the saved runs and also instructions on how to run it locally.
https://github.com/theredsix/abp-online-mind2web-results
link
9wzYQbTYsAIc
102 days ago
90% easy or 90% average?
link
theredsix
102 days ago
90% average with 85.51% hard!
link
9wzYQbTYsAIc
102 days ago
Nice! Will take a look at this for my homelab - was debating using crawl.cloudflare.com to try it out, as browser rendering was my next stretch goal.
link