| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jszymborski 678 days ago

Maybe a pipeline like:

1. Segment document: Identify which part of the document is text, what is an image, what is a formula, what is a table, etc...

2. For text, do OCR + LLM. You can use LLMs to calculate the expectation of the predicted text, and if it is super off, try using ViT or something to OCR.

3. For tables, you can get a ViT/CNN to identify the cells to recover positional information, and then OCR + LLM for recovering the contents of cells

4. For formulas (and formulas in tables), just use a ViT/CNN.

5. For images, you can get a captioning ViT/CNN to caption the photo, if that's desired.

2 comments

ozim 678 days ago

I don't see how you make LLM improve tables where most of the time table is single word or single value that doesn't have continuous context like a sentence.

link

jszymborski 678 days ago

IMHO, the LLM correction is most relevant/useful in the edge cases rather than the modal ones, so I totally agree.

link

refulgentis 678 days ago

They take images

link

troysk 678 days ago

How to segment the document without LLM?

I prefer to do all of this in 1 step with an LLM with a good prompt and few shots.

With so many passes with images, the costs/time will be high with ViT being slower.

link

jszymborski 678 days ago

Segmenting can likely be done on a really small resolution and with a CNN, making it real short.

There are some heuristic ways of doing it but i doubt you'll be able to distinguish equations from text.

link

troysk 676 days ago

Segmenting at lower resolution and then using them at higher resolution using resolution multipliers don't work as other items bleed in. FastSAM paper has some interesting ideas on doing this with CNNs which I guess SAM2 have superseded. However, the complication in the pipeline is not worth the result as I find vision LLMs are able to do almost the same task within the same OCR prompt.

link

wahnfrieden 678 days ago

Apple APIs such as Live Text, subject identification, Vision. Run them on a server, too

link