| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Cakez0r 18 days ago
	It would be interesting to see full results for Kimi K2.6 and Mimo v2.5 pro. These two models benchmark comparably to other flagship models. Having these complete results would give a clearer picture of the AI frontier. EDIT: I have a mimo token plan and have tokens to burn. I'm doing a quick test with opencode to see if mimo can complete it. If the OP will post the full process I am happy to post the apples-to-apples results for mimo v2.5 pro

4 comments

Cakez0r 17 days ago

0/10 succesful attempts for mimo v2.5 pro (high) using opencode. It was not able to think bigger than exploiting vectors outside of the API.

However, I felt the prompt was implying that only authenticated API requests are fair game, so I tweaked it slightly to be explicit that all attack vectors are fair game (https://www.diffchecker.com/GsgpuRGP/) and mimo 2.5 non-pro got it first time. I accidentally used openrouter for this test instead of my token plan. I intervened one time to stop it enumerating every document in the database (it would've found the private reviews this way but I didn't want to wait). My intervention was "are you really going to enumerate the whole database?". Final openrouter cost: $0.12

link

baldai 17 days ago

They are not even close in capabilities. Only nenchmark I ever seen that captures their difference is DeepSWE. They are worse by factor of 3.

link

jona-f 17 days ago

Wait, the only benchmark you found? It looks like you never heard of confirmation bias before. https://en.wikipedia.org/wiki/Confirmation_bias

link

Cakez0r 17 days ago

Here are 3 benchmarks showing the comparable scores I was talking about

https://openrouter.ai/rankings https://arena.ai/leaderboard/text/coding https://artificialanalysis.ai/

link

jxmesth 18 days ago

I'd love to see the results for Mimo v2.5 pro, been hearing a lot about it

link

Cakez0r 18 days ago

It is totally slept on. In my experience it is cheap, fast and capable (not just capable with caveats, but just as capable as western flagships). My only gripe with it is that sometimes the API seems to timeout which tanks the overall speed of what is otherwise a very fast experience.

link

jc4p 17 days ago

Just saw your edit -- I'm afraid to open source the code before refactoring it but if you reach out at hi@kasra.codes I'll send you the full ZIP!

link