Hacker News new | ask | show | jobs
by glerk 3 days ago
This looks really great, more thoughtful than any benchmark that I've seen until now!

I'm curious if you're only interested in scoring frontier models or you would accept submission from custom harnesses? I am working on multi-model harnesses and would love to test them against your benchmark. Do you plan on releasing the tasks publicly?

1 comments

> Do you plan on releasing the tasks publicly?

yep

yay! looking forward, and thanks!