|
Maybe a pipeline like: 1. Segment document: Identify which part of the document is text, what is an image, what is a formula, what is a table, etc... 2. For text, do OCR + LLM. You can use LLMs to calculate the expectation of the predicted text, and if it is super off, try using ViT or something to OCR. 3. For tables, you can get a ViT/CNN to identify the cells to recover positional information, and then OCR + LLM for recovering the contents of cells 4. For formulas (and formulas in tables), just use a ViT/CNN. 5. For images, you can get a captioning ViT/CNN to caption the photo, if that's desired. |