| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by oofbey 511 days ago
	This is really pretty cool. LLM's are so bad at images, it just makes sense to use reasoning to improve them. I'd love to see this applied to a bigger model than 3B, because this task is not difficult. But the attention visualization really demonstrates that it's doing what it's supposed to.

2 comments

skumar17 511 days ago

Thanks! I really love the visualization too. We have a hosted demo you can try as well!

https://huggingface.co/spaces/Groundlight/grpo-vlm-decoder

link

oofbey 511 days ago

Fun! I wish the demo had the attention visualization. Would that be easy to add? Is the source code for the HF demo in the repo too?

link

skumar17 511 days ago

Unfortunately it might be a bit challenging as there’s a nontrivial amount of extra computation we do for the viz, but it’s probably possible?

link

skumar17 511 days ago

The attention demo code is in the /attention_demo directory if you want to try it on your own messages too :)

link

xoofoog 511 days ago

What do you mean LLMs are bad at images? GPT or Claude can read text perfectly, and describe what's in a picture in a lot of detail. I feel like replacing OCR is one of the few things you can actually trust them for.

link

oofbey 511 days ago

That's true - they are quite good at OCR. But they're really bad at a bunch of tasks that seem like they should be super simple. Like "are these lines crossed" or "which letter is circled". See https://vlmsareblind.github.io/ for some clear examples.

link

skumar17 511 days ago

That’s a good observation. For this project, I found that while the base model could “read” the image, it didn’t really understand how to use it. GRPO allowed it to effectively search the solution space.

link