| HN Mirror

This happens -- also variants which have been processed with OCR.

So if it is scanned it contains just a single image - no text.

OCR programs will commonly create a PDF where the text/background and detected images are separate. And then the OCR program inserts transparent (no-draw) letters in place of the text it has identified, or (less frequently) place the letters behind the scanned image in the PDF (i.e. with lower z).

We can detect if something has been generated by an OCR program by looking at the "Creator data" in the PDF that describes the program use to create the PDF. So we can handle that differently (and we do handle that a little bit differently).

PDF->image->text is 100% not lossless.

When you rasterize the PDF, you losing information because you are going from a resolution independent format to a specific resolution: • Text must be rasterized into letters at the target resolution • Images must be resampled at the target resolution • Vector paths must be rasterized to the target resolution

So for example the target resolution must be high enough that small text is legible.

If you perform OCR, you depend on the ability of the OCR program to accurately identify the letters based on the rasterized form.

OCR is not 100% accurate, because it is computer vision recognition problem, and • there are hundrends of thousands of fonts in the wild each with different details and appearances. • two letters can look the same; simple example where trivial OCR/recognition fails is capital letter "I" and lower case "l". These are both vertical lines, so you need the context (letters nearby). Same with "O" and zero. • OCR is also pretty hopeless with e.g. headlines/text written on top of images because it is hard to distinguish letters from the background. But even regular black on white text fails sometimes. • OCR will also commonly identify "ghost" letters in images that are not really there. I.e. pick up a bunch of pixels that have been detected as a letter, but really is just some pixel structure part of the image (not even necessarily text on the image) -- A form of hallucination.