|
|
|
|
|
by troysk
671 days ago
|
|
In my experience, this works well but doesn't scale to all kinds of documents.
For scientific papers; it can't render formulas. meta's nougat is the best model to do that.
For invoices and records; donut works better.
Both these models will fail in some cases so you end up running LLM to fix the issues.
Even with that LLM won't be able to do tables and charts justice, as the details were lost during OCR process (bold/italic/other nuances). I feel these might also be "classical" methods.
I have found vision models to be much better as they have the original document/image. Having prompts which are clear helps but still you won't get 100% results as they tend to venture off on their paths.
I believe that can be fixed using fine tuning but no good vision model provides fine tuning for images.
Google Gemini seems to have the feature but I haven't tried it.
Few shots prompting helps keep the LLM from hallucinating, prompt injection and helps adhering to the format requested. |
|
1. Segment document: Identify which part of the document is text, what is an image, what is a formula, what is a table, etc...
2. For text, do OCR + LLM. You can use LLMs to calculate the expectation of the predicted text, and if it is super off, try using ViT or something to OCR.
3. For tables, you can get a ViT/CNN to identify the cells to recover positional information, and then OCR + LLM for recovering the contents of cells
4. For formulas (and formulas in tables), just use a ViT/CNN.
5. For images, you can get a captioning ViT/CNN to caption the photo, if that's desired.