| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by XCSme 3 hours ago
	I also tested it[0]: quite similar to GLM 5, a few percent better, 30% faster and 50% more expensive. [0]: https://aibenchy.com/?q=glm

3 comments

benxh 2 hours ago

benchmark where gemini flash is better than fable btw.

link

XCSme 1 hour ago

Well, most people were not liking Fable when it was available anyway, because it refused to answer questions very often.

link

margalabargala 1 hour ago

And therefore it scores worse on benchmarks?

link

XCSme 53 minutes ago

Also Claude/Fable models are quite bad at instructions following: https://artificialanalysis.ai/evaluations/ifbench

link

XCSme 55 minutes ago

On some it does yes, also in real usage.

It avoided answering 2/21 tests in this specific benchmark mark, that's already 90% max score already.

link

margalabargala 42 minutes ago

I'm glad those tests apparently work out for you but a benchmark where three of the top 5 models are different flavors of Gemini Flash and zero are anything by Anthropic, is just so wildly divergent from my personal experience with the models that it's not useful to me.

Whatever it is you're measuring, it's not anything related to what I use models for.

link

XCSme 36 minutes ago

Thanks for the feedback!

What are you using Claude models for? Coding only? Computer use? Which harness?

link

XCSme 3 hours ago

PS: Just added a cool feature, so you can filter the leaderboard for multiple models at once, by using a comma, like: https://aibenchy.com/?q=glm,claude

link

lousken 3 hours ago

still 1/4 of the price of anthropic and openai models though

link