| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sarreph 5 days ago
	I'm beginning to wonder how much of a useful metric the pelican is because surely the frontier labs must be training their models on pelican-artistry because of how well known your test is now?

7 comments

bensyverson 5 days ago

Simon has addressed this on virtually every new model release. He also has unpublished alternate prompts. But the larger point is: this is a fun experiment, not a serious and objective benchmark.

link

refulgentis 5 days ago

It's silly and a joke and a surprisingly good benchmark and don't take it seriously but don't take not taking it seriously seriously and if it's too good we use another prompt but don't actually because then it's not the pelican post and there's obvious ways to better it and it's not worth doing because it's not serious.

Only coherent move at this point: hit the minus button immediately. There's never anything about the model in the thread other than simon's post.

link

stasomatic 5 days ago

But what if they are better at flamingos? Are they optimized for pelicans? How about “draw me a four headed owl”? The meme, I get it, but I’d settle for a working bash script, tbh.

link

wongarsu 5 days ago

I just run my own benchmark for "draw an SVG with $animal driving $vehicle". I won't post my choice of animal and mode of transport, but there are plenty of uncommon combinations to choose from. So far it's a fun and visually intuitive benchmark that does seem to correlate with model capabilities

link

modriano 5 days ago

I don't know. Just looking at the bike frames (specifically the fact that the AI generated bikes have rather unsteerable front forks), it's clear to me that frontier labs aren't spending much time tuning models to make bikes look coherent, which I assume is an easier task than making a pelican riding a bike look coherent.

link

HaZeust 5 days ago

I've seen this reply to Simon's benchmark for 2 years running now, and yet you still see improvements and objectively-bad results over time from new releases, even when I'm sure every frontier AI team has/had a person at least partially dedicated to better bicycle-pelican SVG outputs. Alas.

link

sarreph 5 days ago

I had intended to caveat that: I'm sure I'm not the first person to ask about this!

> you still see improvements

This is expected if they are training their models on it, right?

> objectively-bad results

Keen to learn when this has been the case, i.e. across version increments in major models.

link

simonw 5 days ago

I've written about this a couple of times, most notably here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

I've been enjoying seeing how the quality of individual models differ based on the amount of reasoning effort you give them. If they were baking an a good pelican you wouldn't expect them to differ so much.

(Google Gemini are the only lab that have very clearly paid attention to the quality of SVG animals-riding-vehicles, see their announcement for Gemini 3.1: https://twitter.com/JeffDean/status/2024525132266688757 )

link

sarreph 5 days ago

Amazing, thank you Simon! Look forward to reading.

link

mrandish 4 days ago

Hence it has become a meta-benchmark of relative progress in SVG image generation of a known target which has leaked into the training data and for which "every frontier AI team has/had a person at least partially dedicated to" at least checking if not optimizing.

link

llm_nerd 5 days ago

I honestly assumed their comment was tongue in cheek humour, because positively no one actually cares how these models generate an SVG pelican riding a bicycle. It's some meme thing that this stuff always appears here.

link

BrokenCogs 5 days ago

Yeah this is not a real benchmark, it's just a fun tradition everytime a new model is released

link

pelipost123 5 days ago

"fun" / boringly predictable meme thread with 30+ replies already

link

brazukadev 5 days ago

It is telling that people need to create throwaway accounts to criticize simonw's behavior in this website.

link

mrandish 4 days ago

It's evolved from a funny, unserious benchmark to a tradition. When a major new model is released, I now always check the HN thread for Simon's Pelican post. I'll be sad when I don't find it.

When it started, comparing the progress between models was mildly interesting but everyone (including Simon) acknowledges it certainly leaked into the training data long ago.

link

notnullorvoid 5 days ago

The way I see it the benefit of benchmark isn't to take Simon's results at face value. It's a template for your own benchmarks that are easy to visually evaluate.

link

iLoveOncall 5 days ago

It was a completely useless test even before the labs trained for it.

link

mrandish 4 days ago

Yes, it's always been published as a joke. You've explained why it was (and still is) funny meta-commentary on AI benchmarks.

link