| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by codetrotter 499 days ago

Many moons ago I was tasked with extracting data from a bunch of PDFs. I made a tool to visualise how characters were laid out on the page and bounding boxes of all the elements.

The project was in the end a complete failure and several people were upset at me for not delivering what I was supposed to.

In present day, with the capabilities that are now available with LLMs to extract data from PDFs I 100% would go the route of utilising AI to extract the data they wanted. Back then that did not yet exist.

7 comments

bob1029 499 days ago

Parsing data out of arbitrary PDFs is a cursed mission. PDF can contain images, so you might as well target JPEG directly.

OCR can take you pretty far depending on expectations, but it's never quite far enough in my experience.

link

themanmaran 499 days ago

That's been our experience as well. Just scrapping any of the metadata associated with the PDF and treating it like an image. Since you never know when a document has a screenshot of an excel table inside.

The .NORM files (https://xkcd.com/2116)

link

jimjimjim 499 days ago

The LLMs might help with sequencing the characters you extract from the page but actually getting the contents is still difficult. A number of times I've come across a page where the letters of the text are glyphs in a custom font with no mapping to ascii or anything similar or even more common, especially with output from CAD, are letters that are made by drawing lines in the shape of letters so there is nothing identifiable to extract and you are left with OCRing the page to double check the results

link

macklinkachorn 499 days ago

In my previous role, I have experienced similar things where the rule-based parsing approach is really tricky to get right and often failed via from edge cases.

We (at https://runtrellis.com/) have been building PDF processing pipeline from the ground up with LLMs and VLMs and have seen close to 100% accuracy even for tricky PDFs. The key is to use rule based engine and references to cross check the data.

link

spacecadet 498 days ago

Many moons ago worked on extracting 2D CAD drawings from PDFs and converting to full 3D. Fun times.

link

gsempe 494 days ago

I’m very interested on if you managed to make it works?

link

spacecadet 489 days ago

Yes and no. Had some initial prototype but had issues with all of the edge cases related to document formatting and detail and eventually abandoned it.

link

rad_gruchalski 498 days ago

pdfjs does all of that and it’s pretty solid. I used it recently to extract tabular data out of 10 year batch of bank statements.

link

GaggiX 499 days ago

It reminds me of: https://xkcd.com/1425/

In the same way now with today's AI models the task is easily achievable.

link

aboardRat4 498 days ago

mathpix does quite an awesome job actually

link