| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by authorfly 746 days ago

What format? The entire data table in one image, or a PDF for example printed off with 8 pages where the user choose to only put the header on the first page etc? Or decent formatting, font size 8+ on an image with decent resolution? With the latter you are probably fine although you will need some manual implementation for parsing the output. You get bounding boxes at word level. One thing if I started nowadays I would do is use basic columns (x coordinates) to add '|' inbetween the outputs(including detecting empty span positions), keep items with similarish y coordinates together on lines, and put it into ChatGPT to format as desired, I suspect this would avoid misreading.

I would say PaddleOCR is good in general for tables - it's much better (in terms of recall rate) at recognising numerical digits / symbols than Tesseract although I notice it often misrecognises "l" in "Lullaby/ml/million" etc as "1" sometimes.

The cloud providers have better table extraction iff you can guarantee the same format each time for the document.

1 comments

cpursley 746 days ago

A wide variety of PDFs (both in length and content) that can have a variety of different tables, real estate related with a lot of financial content. And I need to be able to run on local models / software (no parsing as a service, no OpenAI, etc).

Here's just one example: https://www.totalflood.com/samples/residential.pdf (I struggle getting accurate data out of the Sales Comp section - basically all approaches mix up the properties.

link

authorfly 739 days ago

Sorry, this will be very hard to do. You can't really try and segment images based on lines as the tables probably varied. The floor plans and things... this data is very very challenging.

I would suggest your best bet is waiting 2 years for the next version of LLAVA to come out which may have capabilities to interpret very accurately on device. The progress with LLAVA has been fast recently but for now it's still a bit too inaccurate.

link