| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by glerk 50 days ago
	This looks really great, more thoughtful than any benchmark that I've seen until now! I'm curious if you're only interested in scoring frontier models or you would accept submission from custom harnesses? I am working on multi-model harnesses and would love to test them against your benchmark. Do you plan on releasing the tasks publicly?

1 comments

swyx 50 days ago

> Do you plan on releasing the tasks publicly?

yep

glerk 50 days ago

yay! looking forward, and thanks!