This is really pretty cool. LLM's are so bad at images, it just makes sense to use reasoning to improve them. I'd love to see this applied to a bigger model than 3B, because this task is not difficult. But the attention visualization really demonstrates that it's doing what it's supposed to.
What do you mean LLMs are bad at images? GPT or Claude can read text perfectly, and describe what's in a picture in a lot of detail. I feel like replacing OCR is one of the few things you can actually trust them for.
That's true - they are quite good at OCR. But they're really bad at a bunch of tasks that seem like they should be super simple. Like "are these lines crossed" or "which letter is circled". See https://vlmsareblind.github.io/ for some clear examples.
That’s a good observation. For this project, I found that while the base model could “read” the image, it didn’t really understand how to use it. GRPO allowed it to effectively search the solution space.