| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by diggan 376 days ago
	Yeah, this is the problem with benchmarks where the questions/problems are public. They're valuable for some months, until it bleeds into the training set. I'm certain a lot of the "improvements" we're seeing are just benchmarks leaking into the training set.

1 comments

travisgriggs 376 days ago

That’s ok, once bicycle “riding” pelicans become normative, we can ask it for images of pelicans humping bicycles.

The number of subject-verb-objects are near infinite. All are imaginable, but most are not plausible. A plausibility machine (LLM) will struggle with the implausible, until it can abstract well.

link

zahlman 376 days ago

I can't fathom this working, simply because building a model that relates the word "ride" to "hump" seems like something that would be orders of magnitude easier for an LLM than visualizing the result of SVG rendering.

link

diggan 376 days ago

> The number of subject-verb-objects are near infinite. All are imaginable, but most are not plausible

Until there is enough unique/new subject-verb-objects examples/benchmarks so the trained model actually generalized it just like you did. (Public) Benchmarks needs to constantly evolve, otherwise they stop being useful.

link

demosthanos 376 days ago

To be fair, once it does generalize the pattern then the benchmark is actually measuring something useful for deciding if the model will be able to product a subject-verb-object SVG.

link