| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Manabu-eo 127 days ago
	How likely this problem is already on the training set by now?

4 comments

simonw 127 days ago

If anyone trains a model on https://simonwillison.net/tags/pelican-riding-a-bicycle/ they're going to get some VERY weird looking pelicans.

link

suddenlybananas 127 days ago

Why would they train on that? Why not just hire someone to make a few examples.

link

simonw 127 days ago

I look forward to them trying. I'll know when the pelican riding a bicycle is good but the ocelot riding a skateboard sucks.

link

suddenlybananas 127 days ago

But they could just train on an assortment of animals and vehicles. It's the kind of relatively narrow domain where NNs could reasonably interpolate.

link

simonw 127 days ago

The idea that an AI lab would pay a small army of human artists to create training data for $animal on $transport just to cheat on my stupid benchmark delights me.

link

suddenlybananas 127 days ago

When you're spending trillions on capex, paying a couple of people to make some doodles in SVGs would not be a big expense.

link

dontwannahearit 127 days ago

Would it not be better to have 100 such tests "Pelican on bicycle", "Tiger on stilts"..., and generate them all for every new model but only release a new one each time. That way you could show progression across all models, attempts at benchmaxxing would be more obvious.

Given the crazy money and vying for supremacy among AI companies right now it does seem naive to belive that no attempt at better pelicans on bicycles is being made. You can argue "but I will know because of the quality of ocelots on skateboards" but without a back catalog of ocelots on skateboards to publish its one datapoint and leaves the AI companies with too much plausible deniability.

The pelicans-on-bicycles is a bit of fun for you (and us!) but it has become a measure of the quality of models so its serious business for them.

There is an assymetry of incentives and high risk you are being their useful idiot. Sorry to be blunt.

link

Applejinx 127 days ago

Or indeed do the Markov chain conceptual slip. Pelican on bicycle, badger on stool, tiger on acid. Pelican on bicycle is definitely cooked, though: people know it and it's talked about in language.

link

throwup238 127 days ago

For every combination of animal and vehicle? Very unlikely.

The beauty of this benchmark is that it takes all of two seconds to come up with your own unique one. A seahorse on a unicycle. A platypus flying a glider. A man’o’war piloting a Portuguese man of war. Whatever you want.

link

recursive 127 days ago

No, not every combination. The question is about the specific combination of a pelican on a bicycle. It might be easy to come up with another test, but we're looking at the results from a particular one here.

link

svara 127 days ago

More likely you would just train for emitting svg for some description of a scene and create training data from raster images.

link

recursive 127 days ago

None of this works if the testers are collaborating with the trainers. The tests ostensibly need to be arms-length from the training. If the trainers ever start over-fitting to the test, the tester would come up with some new test secretly.

link

ebonnafoux 127 days ago

You can easily make a RLAIF loop.

- Take a list of n animals * m vehicule

- Ask a LLM to generate SVG for this n*m options

- Generate png from the svg

- Ask a Model with vision to grade the result

- Change your weight accordingly

No need to human to draw the dataset, no need of human to evaluate.

link

verdverm 127 days ago

I've heard it posited that the reason the frontier companies are frontier is because they have custom data and evals. This is what I would do too

link

zarzavat 127 days ago

You can always ask for a tyrannosaurus driving a tank.

link