| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by simonw 416 days ago

It might not be 100% clear from the writing but this benchmark is mainly intended as a joke - I built a talk around it because it's a great way to make the last six months of model releases a lot more entertaining.

I've been considering an expanded version of this where each model outputs ten images, then a vision model helps pick the "best" of those to represent that model in a further competition with other models.

(Then I would also expand the judging panel to three vision LLMs from different model families which vote on each round... partly because it will be interesting to track cases where the judges disagree.)

I'm not sure if it's worth me doing that though since the whole "benchmark" is pretty silly. I'm on the fence.

5 comments

demosthanos 416 days ago

I'd say definitely do not do that. That would make the benchmark look more serious while still being problematic for knowledge cutoff reasons. Your prompt has become popular even outside your blog, so the odds of some SVG pelicans on bicycles making it into the training data have been going up and up.

Karpathy used it as an example in a recent interview: https://www.msn.com/en-in/health/other/ai-expert-asks-grok-3...

diggan 416 days ago

Yeah, this is the problem with benchmarks where the questions/problems are public. They're valuable for some months, until it bleeds into the training set. I'm certain a lot of the "improvements" we're seeing are just benchmarks leaking into the training set.

travisgriggs 416 days ago

That’s ok, once bicycle “riding” pelicans become normative, we can ask it for images of pelicans humping bicycles.

The number of subject-verb-objects are near infinite. All are imaginable, but most are not plausible. A plausibility machine (LLM) will struggle with the implausible, until it can abstract well.

zahlman 416 days ago

I can't fathom this working, simply because building a model that relates the word "ride" to "hump" seems like something that would be orders of magnitude easier for an LLM than visualizing the result of SVG rendering.

diggan 416 days ago

> The number of subject-verb-objects are near infinite. All are imaginable, but most are not plausible

Until there is enough unique/new subject-verb-objects examples/benchmarks so the trained model actually generalized it just like you did. (Public) Benchmarks needs to constantly evolve, otherwise they stop being useful.

demosthanos 416 days ago

To be fair, once it does generalize the pattern then the benchmark is actually measuring something useful for deciding if the model will be able to product a subject-verb-object SVG.

throwaway31131 416 days ago

I’d say it doesn’t really matter. There is no universally good benchmark and really they should only be used to answer very specific questions which may or may not be relevant to you.

Also, as the old saying goes, the only thing worse than using benchmarks is not using benchmarks.

6LLvveMx2koXfwn 416 days ago

I would definitely say he had no intention of doing that and was doubling down on the original joke.

colecut 416 days ago

The road to hell is paved with the best intentions

clarification: I enjoyed the pelican on a bike and don't think it's that bad =p

telotortium 416 days ago

Yeah, Simon needs to release a new benchmark under a pen name, like Stephen King did with Richard Bachman.

Breza 411 days ago

Richard Bachman, you say? https://chatgpt.com/share/684c3f20-575c-800a-9ea2-889dd3deaf...

fzzzy 416 days ago

Even if it is a joke, having a consistent methodology is useful. I did it for about a year with my own private benchmark of reasoning type questions that I always applied to each new open model that came out. Run it once and you get a random sample of performance. Got unlucky, or got lucky? So what. That's the experimental protocol. Running things a bunch of times and cherry picking the best ones adds human bias, and complicates the steps.

simonw 416 days ago

It wasn't until I put these slides together that I realized quite how well my joke benchmark correlates with actual model performance - the "better" models genuinely do appear to draw better pelicans and I don't really understand why!

pama 416 days ago

How did the pelicans of point releases of V3 and of R1 (R1-0528) do compared to the original versions of the models?

famouswaffles 416 days ago

LLMs also have a 'g factor' https://www.sciencedirect.com/science/article/pii/S016028962...

MichaelZuo 416 days ago

I imagine the straightforward reason is that the “better” models are in fact significantly smarter in some tangible way, somehow.

johnrob 416 days ago

Well, the most likely single random sample would be a “representative” one :)

tuananh 416 days ago

until they start targeting this benchmark

simonw 416 days ago

Right, that was the closing joke for the talk.

jonstewart 416 days ago

It is funny to think that a hundred years in the future there may be some vestigial area of the models’ networks that’s still tuned to drawing pelicans on bicycles.

more-nitor 416 days ago

I just don't get the fuss from the pro-LLM people who don't want anyone to shame their LLMs...

people expect LLMs to say "correct" stuff on the first attempt, not 10000 attempts.

Yet, these people are perfectly OK with cherry-picked success stories on youtube + advertisements, while being extremely vehement about this simple experiment...

...well maybe these people rode the LLM hype-train too early, and are desperate to defend LLMs lest their investment go poof?

obligatory hype-graph classic: https://upload.wikimedia.org/wikipedia/commons/thumb/9/94/Ga...

Breza 411 days ago

Another advantage is you can easily include deprecated models in your comparisons. I maintain our internal LLM rankings at work. Since the prompts have remained the same, I can do things like compare the latest Gemini Pro to the original Bard.

Breza 411 days ago

I'd be really interested in evaluating the evaluations of different models. At work, I maintain our internal LLM benchmarks for content generation. We've always used human raters from MTurk, and the Elo rankings generally match what you'd expect. I'm looking at our options for having LLMs do the evaluating.

In your case, it would be neat to have a bunch of different models (and maybe MTurk) pick the winners of each head-to-head matchup and then compare how stable the Elo scores are between evaluators.

dilap 416 days ago

Joke or not, it still correlates much better with my own subjective experiences of the models than LM Arena!

ontouchstart 416 days ago

Very nice talk, acceptable by general public and by AI agent as well.

Any concerns about open source “AI celebrity talks” like yours can be used in contexts that would allow LLM models to optimize their market share in ways that we can’t imagine yet?

Your talk might influence the funding of AI startups.

#butterflyEffect

threecheese 416 days ago

I welcome a VC funded pelican … anything! Clippy 2.0 maybe?

Simon, hope you are comfortable in your new role of AI Celebrity.