| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by WarmWash 51 days ago
	LLMs can't really "see", so I challenge you to draw a pelican on a bike without any visual feedback, just code. Because that is how they are doing it. Vision tokens for transformers aren't really well solved yet, which is why they can smash a phd math problem and trip over a "count the cats on the chair" problem.

1 comments

raffael_de 50 days ago

It's not about seeing. It's about identifying the legs of the Pelican and then transferring the concept and mechanics of riding a bicycle + geometry of a body and a bicycle. The entire task has also nothing to do with vision tokens.

link

mountainriver 49 days ago

If we want to train a model excessively on SVGs it will obviously be able to do this. We have only just started trying to do that

link

WarmWash 50 days ago

> It's about identifying the legs

So, seeing?

link

raffael_de 50 days ago

seeing isn't necessary to understand what a leg is.

link

WarmWash 49 days ago

Which I why humans can draw so well with their eyes closed?

link