| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by XCSme 69 days ago

If it's relevant to the discussion, I hope not.

I've spent probably over100 hours working on this benchmarking/site platform, and all tests are manually written. For me (and many others that reached out to me) are not useless either. I use this myself regularly when choosing and comparing new models. I honestly beleive it is providing value to the conversation.

Let me know if you know of a better platform you can use to compare models, I built this one because I didn't find any with good enough UX.

1 comments

jaggs 69 days ago

It's a great benchmark. Don't listen to the haters. This one is especially interesting.

https://aibenchy.com/compare/anthropic-claude-sonnet-4-6-med...

link

BoorishBears 68 days ago

This one's even more interesting

https://aibenchy.com/compare/anthropic-claude-opus-4-6-mediu...

Who knew Anthropic was this far behind???

link

jaggs 68 days ago

Yeah, but actually that's not a good look. Anyone who's used Gemini will know how random it is in terms of getting anything serious done, compared to the rock solid opus experience.

link

BoorishBears 68 days ago

Their benchmark is chock-full of things like that: It's deeply flawed and is essentially rating how LLMs perform if you exert yourself trying to hold them entirely the wrong way.

link