| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kwon-young 317 days ago

Just to illustrate this point, poppler [1] (which is the most popular pdf renderer in open source) has a little tool called pdf2cairo [2] which can render a pdf into a svg. This means you can delegate all pdf rendering to poppler and only work with actual graphical objects to extract semantics.

I think the reason this method is not popular is that there are still many ways to encode a semantic object graphically. A sentence can be broken down into words or letters. Table lines can be formed from multiple smaller lines, etc. But, as mentioned by the parent, rule based systems works reasonably well for reasonably focused problems. But you will never have a general purpose extractor since rules needs to be written by humans.

[1] https://poppler.freedesktop.org/ [2] https://gitlab.freedesktop.org/poppler/poppler/-/blob/master...

2 comments

throwaway4496 317 days ago

There is also PDF to HTML, PDF to Text, MuPDF also has PDF to XML, both projects along with a bucketful of other PDF toolkits have PDF to PS, and there is many many XML, HTML, and Text outputs for PS.

Rastering and OCR'ing PDF is like using regex to parse XHTML. My eyes are starting to bleed out, I am done here.

link

ramraj07 317 days ago

It looks like you make a lot of valid points, but also have an extremely visceral reaction because theres a company out there thats using AI in a way that offends you. I mean fair still.

But im a guy who's in the market for a pdf parser service, im happy to pay pretty penny per page processed. I just want a service that works without me thinking for a second about any of the problems you guys are all discussing. What service do I use? Do I care if it uses AI in the lamest way possible? The only thing that matters are the results. There are two people including you in this thread ramming with pdf parsing gyan but from reading it all, it doesn't look like I can do things the right way without spending months fully immersed in this problem alone. If you or anyone has a non blunt AI service that I can use Ill be glad to check it out.

link

throwaway4496 316 days ago

It is a hard problem, yes, but you don't solve it by rastering it, OCR, and then using AI. You render it into a structured format. Then at least you don't have to worry about hallucinations, fancy fonts OCR problems, text shaping problems, huge waste of GPU and CPU to paint an image only to OCR it and throw it away.

Use a solution that renders PDF into structured data if you want correct and reliable data.

link

anthk 317 days ago

pdftotext from poppler has that without doing juggling with formats.

link