| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by petesergeant 319 days ago

> instead of just using the "quality implementation" to actually get structured data out?

I suggest spending a few minutes using a PDF editor program with some real-world PDFs, or even just copying and pasting text from a range of different PDFs. These files are made up of cute-tricks and hacks that whatever produced them used to make something that visually works. The high-quality implementations just put the pixels where they're told to. The underlying "structured data" is a lie.

EDIT: I see from further down the thread that your experience of PDFs comes from programmatically generated invoice templates, which may explain why you think this way.

1 comments

joakleaf 319 days ago

We do a lot of parsing of PDFs and basically break the structure into 'letter with font at position (box)' because the "structure" within the PDF is unreliable.

We have algorithms that combines the individual letters to words, words to lines, lines to boxes all by looking at it geometrically. Obviously identify the spaces between words.

We handle hidden text and problematic glyph-to-unicode tables.

The output is similar to OCR except we don't do the rasterization and quality is higher because we don't depend on vision based text recognition.

The base implementation of all this, I made in less than a month 10 years ago and we rarely, if ever, touch it.

We do machine learning afterwards on the structure output too.

link

petesergeant 319 days ago

Very interesting. How often do you encounter PDFs that are just scanned pages? I had to make heavy use of pdfsandwich last time I was accessing journal articles.

> quality is higher because we don't depend on vision based text recognition

This surprises me a bit; outside of an actual scan leaving the computer I’d expect PDF->image->text in a computer to be essentially lossless.

link

joakleaf 319 days ago

This happens -- also variants which have been processed with OCR.

So if it is scanned it contains just a single image - no text.

OCR programs will commonly create a PDF where the text/background and detected images are separate. And then the OCR program inserts transparent (no-draw) letters in place of the text it has identified, or (less frequently) place the letters behind the scanned image in the PDF (i.e. with lower z).

We can detect if something has been generated by an OCR program by looking at the "Creator data" in the PDF that describes the program use to create the PDF. So we can handle that differently (and we do handle that a little bit differently).

PDF->image->text is 100% not lossless.

When you rasterize the PDF, you losing information because you are going from a resolution independent format to a specific resolution: • Text must be rasterized into letters at the target resolution • Images must be resampled at the target resolution • Vector paths must be rasterized to the target resolution

So for example the target resolution must be high enough that small text is legible.

If you perform OCR, you depend on the ability of the OCR program to accurately identify the letters based on the rasterized form.

OCR is not 100% accurate, because it is computer vision recognition problem, and • there are hundrends of thousands of fonts in the wild each with different details and appearances. • two letters can look the same; simple example where trivial OCR/recognition fails is capital letter "I" and lower case "l". These are both vertical lines, so you need the context (letters nearby). Same with "O" and zero. • OCR is also pretty hopeless with e.g. headlines/text written on top of images because it is hard to distinguish letters from the background. But even regular black on white text fails sometimes. • OCR will also commonly identify "ghost" letters in images that are not really there. I.e. pick up a bunch of pixels that have been detected as a letter, but really is just some pixel structure part of the image (not even necessarily text on the image) -- A form of hallucination.

link