Hacker News new | ask | show | jobs
by pilooch 488 days ago
The question is what is OCR for ? If it's to answer questions and work with a document, then VLMs do actually contain self correcting mechanisms. That is, the end to end image + text input to text output is statistically grounded, by training. So the question to ask is what do you need OCR for ? Fedding an LLM? Then feed it to the VLM instead. Some other usage ? Well, to be decided. But near now, CTX and lstms are done with, because VLMs do everything: finding the area to read, reading, embedding, and answering. OCR was a mid-step, it's going away.