Hacker News new | ask | show | jobs
by vintermann 673 days ago
I agree that vision models that actually have access to the image are a more sound approach than using OCR and trying to fix it up. It may be more expensive though, and depending on what you're trying to do it may be good enough.

What I want to do is reading handwritten documents from the 18th century, and I feel like the multistep approach hits a hard ceiling there. Transkribus is multistep, but the line detecion model is just terrible. Things that should be easy, such as printed schemas, utterly confuse it. You simply need to be smart about context to a much higher degree than you need in OCR of typewriter-written text.

2 comments

I also think it’s probably more effective. Every time hand-crafted tools are better than AI but then the model becomes bigger and AI wins. Think hand crafted image classification to full model or hand crafted language translation to full model.

In this case, the model can already do the OCR and becomes an order of magnitude cheaper per year.

both openai and claude vision models are able do that for me. It is more expensive than tesseract which can run on cpu but I assume it will become similarly cheap in the near future with open models and as AI becomes ubiquitous.