| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kethinov 28 days ago

Can someone explain what the current state of model benchmarking is? If you try to look up what the best locally runnable model is, you get a bunch of random blog posts using idiosyncratic criteria to rank things seemingly based on one dude's opinion.

Ideally I would love to see a leaderboard with relatively objective ranking criteria that 1. lets you filter by open weight / locally runnable, 2. filter by date of release (nothing older than x), and 3. is agnostic to hardware requirements. I just want to know what the best model is. Let me worry about how I will afford to run it.

I love the llmfit project for seeing what will run on your hardware, but it would be nice to know what I'm missing out on by not having better hardware, thus why objective hardware-agnostic ratings would be helpful.

5 comments

vessenes 28 days ago

That would be nice, but it's not going to be possible.

Any open benchmark has a very short life, since it will be pulled in and DPO / RL trained quickly for benchmaxxing purposes. So, you'll need a private test to have a hope of something fair. (These also get leaked over time, btw, so even then there's a window of usability).

These are expensive to run.

Now consider that there might be 15-20 viable quants for a given open model release; someone would have to want to pay for these private evals to be run on them. Even then, a good read through unsloth's commits and blog posts will remind you that there's quite a lot of engineering work to be done to get model inference working properly, even for models released by frontier or near-frontier labs. So, you'd want to make sure that you have a replicable 'best engineered' deployment to evaluate, or at least one that's closest to your hardware and fits the bill.

Upshot - it's much faster to download and try out a model, and possibly cheaper too. Well, cheaper since hugging face is paying the bandwidth bills.

link

cyanydeez 28 days ago

there are benchmarks that have nothing to do with the training material, but with how the models are capable of things like reading code: https://needle-bench.cc/

Generally, you give them a document and you ask them to retrieve some subsection of the document then rate them on what they retrieved.

You can always find enough random documents, or create your own, to always run these and you can make it arbitrarily long. It's definitely a valid non-maxxable context test.

link

jononor 27 days ago

This seems like a viable eval strategy. Presumably finding a bug requires some degree of understanding of the code, beyond just information retrieval. However it probably does not measure things like prompt adherence or ability to create code that implements a specification?

link

cyanydeez 27 days ago

you can extend the test pretty easily. run through design turns and ask it for it again and again. effectively measure context length.

ask it to modify lines 120-130 and add more context, etc.

we have rudimentry preLLM algoritms that can measure hamming distance and hashing.

you could even go all https://en.wikipedia.org/wiki/Jabberwocky to see if its sense of context is easily polluted.

the point though is there are benchmarks beyong pelican on a bike that cant be tokenmaxx and prove real value in capabilities

link

sigmoid10 28 days ago

>I just want to know what the best model is. Let me worry about how I will afford to run it.

This is a very typical manager question that I suppose many people have who fail to see the simple truth: There is no "best" model. There are only best models for certain use-cases. Sometimes you'll find these in custom community leaderboards on platforms like huggingface, but for most business applications you'll probably have to come up with your own benchmark. Most common benchmarks are pretty worthless by now because all the usual ones are being gamed hard by model providers, to the point that there are now sometimes drastic differences between models that perform very similarly on common benchmarks.

link

sleepyeldrazi 28 days ago

The best thing I have come up with is just make a bunch of prompts / tasks that I personally care about and need a model to know how to do. As an example, when qwen3.6 27B dropped, I ran it, kimi, claude and glm 5/5.1 on a bunch of LLM-architecture specific tasks (stuff like 'implement an incremental KV-cache for autoregressive transformer inference' or 'implement flash Attention backward pass with D-optimization') and analyze the results, who made tests, are the tests valid, does their implementation actually work or are they only claiming it to, that sort of thing.

It is a day/weekend worth of work, but I think this is the best way to determine if the model fits your need specifically. This is what lead me to finding out that qwen 27b outperformed even kimi on those tasks, and that opus tries gaslighting me when I give it a spec of something that has been proven, but no published solution exists online. All other models gave their best shot at solving it, opus just said it's not possible (even when I gave it the finished working product that obviously works).

Especially for small models (but also big ones) I think the only way to know if a model will improve your workflow is this, personal benchmarks, expanded over time, ran in private.

link

Alifatisk 27 days ago

> Ideally I would love to see a leaderboard with relatively objective ranking criteria that 1. lets you filter by open weight / locally runnable, 2. filter by date of release (nothing older than x), and 3. is agnostic to hardware requirements. I just want to know what the best model is. Let me worry about how I will afford to run it.

Stick to artificialanalysis.ai it has become the norm

link

lofaszvanitt 28 days ago

benchmarks = bs

link