| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nathan_phoenix 416 days ago

My biggest gripe is that he's comparing probabilistic models (LLMs) by a single sample.

You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers...

Would be nicer to run the comparison with 10 images (or more) for each LLM and then average.

6 comments

simonw 416 days ago

It might not be 100% clear from the writing but this benchmark is mainly intended as a joke - I built a talk around it because it's a great way to make the last six months of model releases a lot more entertaining.

I've been considering an expanded version of this where each model outputs ten images, then a vision model helps pick the "best" of those to represent that model in a further competition with other models.

(Then I would also expand the judging panel to three vision LLMs from different model families which vote on each round... partly because it will be interesting to track cases where the judges disagree.)

I'm not sure if it's worth me doing that though since the whole "benchmark" is pretty silly. I'm on the fence.

demosthanos 416 days ago

I'd say definitely do not do that. That would make the benchmark look more serious while still being problematic for knowledge cutoff reasons. Your prompt has become popular even outside your blog, so the odds of some SVG pelicans on bicycles making it into the training data have been going up and up.

Karpathy used it as an example in a recent interview: https://www.msn.com/en-in/health/other/ai-expert-asks-grok-3...

diggan 416 days ago

Yeah, this is the problem with benchmarks where the questions/problems are public. They're valuable for some months, until it bleeds into the training set. I'm certain a lot of the "improvements" we're seeing are just benchmarks leaking into the training set.

travisgriggs 416 days ago

That’s ok, once bicycle “riding” pelicans become normative, we can ask it for images of pelicans humping bicycles.

The number of subject-verb-objects are near infinite. All are imaginable, but most are not plausible. A plausibility machine (LLM) will struggle with the implausible, until it can abstract well.

zahlman 416 days ago

I can't fathom this working, simply because building a model that relates the word "ride" to "hump" seems like something that would be orders of magnitude easier for an LLM than visualizing the result of SVG rendering.

diggan 416 days ago

> The number of subject-verb-objects are near infinite. All are imaginable, but most are not plausible

Until there is enough unique/new subject-verb-objects examples/benchmarks so the trained model actually generalized it just like you did. (Public) Benchmarks needs to constantly evolve, otherwise they stop being useful.

demosthanos 416 days ago

To be fair, once it does generalize the pattern then the benchmark is actually measuring something useful for deciding if the model will be able to product a subject-verb-object SVG.

throwaway31131 416 days ago

I’d say it doesn’t really matter. There is no universally good benchmark and really they should only be used to answer very specific questions which may or may not be relevant to you.

Also, as the old saying goes, the only thing worse than using benchmarks is not using benchmarks.

6LLvveMx2koXfwn 416 days ago

I would definitely say he had no intention of doing that and was doubling down on the original joke.

colecut 416 days ago

The road to hell is paved with the best intentions

clarification: I enjoyed the pelican on a bike and don't think it's that bad =p

telotortium 415 days ago

Yeah, Simon needs to release a new benchmark under a pen name, like Stephen King did with Richard Bachman.

Breza 411 days ago

Richard Bachman, you say? https://chatgpt.com/share/684c3f20-575c-800a-9ea2-889dd3deaf...

fzzzy 416 days ago

Even if it is a joke, having a consistent methodology is useful. I did it for about a year with my own private benchmark of reasoning type questions that I always applied to each new open model that came out. Run it once and you get a random sample of performance. Got unlucky, or got lucky? So what. That's the experimental protocol. Running things a bunch of times and cherry picking the best ones adds human bias, and complicates the steps.

simonw 416 days ago

It wasn't until I put these slides together that I realized quite how well my joke benchmark correlates with actual model performance - the "better" models genuinely do appear to draw better pelicans and I don't really understand why!

pama 416 days ago

How did the pelicans of point releases of V3 and of R1 (R1-0528) do compared to the original versions of the models?

famouswaffles 416 days ago

LLMs also have a 'g factor' https://www.sciencedirect.com/science/article/pii/S016028962...

MichaelZuo 416 days ago

I imagine the straightforward reason is that the “better” models are in fact significantly smarter in some tangible way, somehow.

johnrob 416 days ago

Well, the most likely single random sample would be a “representative” one :)

tuananh 416 days ago

until they start targeting this benchmark

simonw 416 days ago

Right, that was the closing joke for the talk.

jonstewart 416 days ago

It is funny to think that a hundred years in the future there may be some vestigial area of the models’ networks that’s still tuned to drawing pelicans on bicycles.

more-nitor 416 days ago

I just don't get the fuss from the pro-LLM people who don't want anyone to shame their LLMs...

people expect LLMs to say "correct" stuff on the first attempt, not 10000 attempts.

Yet, these people are perfectly OK with cherry-picked success stories on youtube + advertisements, while being extremely vehement about this simple experiment...

...well maybe these people rode the LLM hype-train too early, and are desperate to defend LLMs lest their investment go poof?

obligatory hype-graph classic: https://upload.wikimedia.org/wikipedia/commons/thumb/9/94/Ga...

Breza 411 days ago

Another advantage is you can easily include deprecated models in your comparisons. I maintain our internal LLM rankings at work. Since the prompts have remained the same, I can do things like compare the latest Gemini Pro to the original Bard.

Breza 411 days ago

I'd be really interested in evaluating the evaluations of different models. At work, I maintain our internal LLM benchmarks for content generation. We've always used human raters from MTurk, and the Elo rankings generally match what you'd expect. I'm looking at our options for having LLMs do the evaluating.

In your case, it would be neat to have a bunch of different models (and maybe MTurk) pick the winners of each head-to-head matchup and then compare how stable the Elo scores are between evaluators.

dilap 416 days ago

Joke or not, it still correlates much better with my own subjective experiences of the models than LM Arena!

ontouchstart 416 days ago

Very nice talk, acceptable by general public and by AI agent as well.

Any concerns about open source “AI celebrity talks” like yours can be used in contexts that would allow LLM models to optimize their market share in ways that we can’t imagine yet?

Your talk might influence the funding of AI startups.

#butterflyEffect

threecheese 416 days ago

I welcome a VC funded pelican … anything! Clippy 2.0 maybe?

Simon, hope you are comfortable in your new role of AI Celebrity.

planb 416 days ago

And by a sample that has become increasingly known as a benchmark. Newer training data will contain more articles like this one, which naturally improves the capabilities of an LLM to estimate what’s considered a good „pelican on a bike“.

criddell 416 days ago

And that’s why he says he’s going to have to find a new benchmark.

viraptor 416 days ago

Would it though? There really aren't that many valid answers to that question online. When this is talked about, we get more broken samples than reasonable ones. I feel like any talk about this actually sabotages future training a bit.

I actually don't think I've seen a single correct svg drawing for that prompt.

cyanydeez 416 days ago

So what you really need to do is clone this blog post, find and replace pelican with any other noun, run all the tests, and publish that.

Call it wikipediaslop.org

YuccaGloriosa 416 days ago

If the any other noun becomes fish... I think I disagree.

puttycat 416 days ago

You are right, but the companies making these models invest a lot of effort in marketing them as anything but probabilistic, i.e. making people think that these models work discretely like humans.

In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.

In any case, even if a model is probabilistic, if it had correctly learned the relevant knowledge you'd expect the output to be perfect because it would serve to lower the model's loss. These outputs clearly indicate flawed knowledge.

ben_w 416 days ago

> In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.

Look upon these works, ye mighty, and despair: https://www.gianlucagimini.it/portfolio-item/velocipedia/

jodrellblank 416 days ago

You claim those are drawn by people with "perfect knowledge about bikes" and "perfect drawing skills"?

ben_w 416 days ago

More that "these models work … like humans" (discretely or otherwise) does not imply the quotation.

Most humans do not have perfect drawing skills and perfect knowledge about bikes and birds, they do not output such a simple drawing correctly 100% of the time.

"Average human" is a much lower bar than most people want to believe, mainly because most of us are average on most skills, and also overestimate our own competence — the modal human has just a handful of things they're good at, and one of those is the language they use, another is their day job.

Most of us can't draw, and demonstrably can't remember (or figure out from first principles) how a bike works. But this also applies to "smart" subsets of the population: physicists have https://xkcd.com/793/, and there's this famous rocket scientist who weighed in on rescuing kids from a flooded cave, they come up with some nonsense about a submarine.

Retric 416 days ago

It’s not that humans have perfect drawing skills, it’s that humans can judge their performance and get better over time.

Ask 100 random people to draw a bike and in 10 minutes and they’ll on average suck while still beating the LLM’s here. Give em an incentive and 10 months and the average person is going to be able to make at least one quite decent drawing of a bike.

The cost and speed advantage of LLM’s is real as long as you’re fine with extremely low quality. Ask a model for 10,000 drawings so you can pick the best and you get a marginal improvements based on random chance at a steep price.

ben_w 416 days ago

> Ask 100 random people to draw a bike and in 10 minutes and they’ll on average suck while still beating the LLM’s here.

Y'see, this is a prime example of what I meant with ""Average human" is a much lower bar than most people want to believe, mainly because most of us are average on most skills, and also overestimate our own competence".

An expert artist can spend 10 minutes and end up with a brief sketch of a bike. You can witness this exact duration yourself (with non-bike examples) because of a challenge a few years back to draw the same picture in 10 minutes, 1 minute, and 10 seconds.

A normal person spending as much time as they like gets you the pictures that I linked to in the previous post, because they don't really know what a bike is. 45 examples of what normal people think a bike looks like: https://www.gianlucagimini.it/portfolio-item/velocipedia/

> Give em an incentive and 10 months and the average person is going to be able to make at least one quite decent drawing of a bike.

Given mandatory art lessons in school are longer than 10 months, and yet those bike examples exist, I have no reason to believe this.

> Ask a model for 10,000 drawings so you can pick the best and you get a marginal improvements based on random chance at a steep price.

If you do so as a human, rating and comparing images? Then the cost is your own time.

If you automate it in literally the manner in this write-up (pairwise comparison via API calls to another model to get ELO ratings), ten thousand images is like $60-$90, which is on the low end for a human commission.

rightbyte 416 days ago

That blog post is a 10/10. Oh dear I miss the old internet.

cyanydeez 416 days ago

Humans absolutely do not work discretely.

loloquwowndueo 416 days ago

They probably meant deterministically as opposed to probabilistically. Which also humans dont work like that :)

aspenmayer 416 days ago

I thought they meant discreetly.

bufferoverflow 416 days ago

> work discretely like humans

What kind of humans are you surrounded by?

Ask any human to write 3 sentences about a specific topic. Then ask them the same exact question next day. They will not write the same 3 sentences.

mooreds 416 days ago

My biggest gripe is that he outsourced evaluation of the pelicans to another LLM.

I get it was way easier to do and that doing it took pennies and no time. But I would have loved it if he'd tried alternate methods of judging and seen what the results were.

Other ways:

* wisdom of the crowds (have people vote on it)

* wisdom of the experts (send the pelican images to a few dozen artists or ornithologists)

* wisdom of the LLMs (use more than one LLM)

Would have been neat to see what the human consensus was and if it differed from the LLM consensus

Anyway, great talk!

zahlman 416 days ago

It would have been interesting to see if the LLM that Claude judged worst would have attempted to justify itself....

timewizard 416 days ago

My biggest gripe is he didn't include a picture of an actual pelican.

https://www.google.com/search?q=pelican&udm=2

The "closest pelican" is not even close.

qeternity 416 days ago

I think you mean non-deterministic, instead of probabilistic.

And there is no reason that these models need to be non-deterministic.

skybrian 416 days ago

A deterministic algorithm can still be unpredictable in a sense. In the extreme case, a procedural generator (like in Minecraft) is deterministic given a seed, but you will still have trouble predicting what you get if you change the seed, because internally it uses a (pseudo-)random number generator.

So there’s still the question of how controllable the LLM really is. If you change a prompt slightly, how unpredictable is the change? That can’t be tested with one prompt.

rvz 416 days ago

> I think you mean non-deterministic, instead of probabilistic.

My thoughts too. It's more accurate to label LLMs as non-deterministic instead of "probablistic".