Hacker News new | ask | show | jobs
by Workaccount2 169 days ago
LLM's are bad at anything with images.

There is something fucky about tokenizing images that just isn't as clean as tokenizing text. It's clear that the problem isn't the model being too dumb, but rather that model is not able to actually "see" the image presented. It feels like a lower-performance model looks at the image, and then writes a text description of it for the "solver" model to work with.

To put it another way, the models can solve very high level text-based problems while struggling to solve even low level image problems - even if underneath both problems use a similar or even identical solving frameworks. If you have a choice between showing a model a graph or feeding it a list of (x,y) coordinates, go with the coordinates every time.