Hacker News new | ask | show | jobs
by vintermann 640 days ago
I'm not a start-up, but I want to use Llama and related model to transcribe historical handwritten documents, and if possible to extract structured data from them which aren't directly visible in a word for word transcription (many of the documents are forms).

I've tried many different models, but vision models are overwhelmingly oriented towards pictures rather than writing, and results aren't good.

1 comments

I was pleasantly surprised by the "OCR" results MiniCPM-V 2.6 gives on any kind of text, including handwritten, given an image and trivial prompt. I'll be sure to keep an eye out on this family of models.

It's no replacement for OCR of printed text, of course, due to sometimes generating random text, but it looked very useful for handwritten text and all kinds of decorative fonts (e.g. "inspirational posters"). I imagine this could work:

  * if you're going to check the output manually or

  * somehow make it part of a pipeline where this model recognizes the rough layout of the page and to get reliable text you cut it up and run traditional OCR on the blocks or

  * somehow diff the VLM output and the OCR tool output
although keep in mind that MiniCPM-V can't identify pixel positions in the image like Gemini Pro here: https://simonwillison.net/2024/Aug/26/gemini-bounding-box-vi...