| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aidos 1016 days ago

This topic comes up periodically as most people think PDFs are some impenetrable binary format, but they’re really not.

They are a graph of objects of different types. The types themselves are well described in the official spec (I’m a sadist, I read it for fun).

My advice is always to convert the pdf to a version without compressed data like the author here has. My tool of choice is mutool (mutool clean -d in.pdf out.pdf). Then just have a rummage. You’ll be surprised by how much you can follow.

In the article the author missed a step where you look at the page object to see the resources. That’s where the mapping from the font name use in the content stream to the underlying object is made.

There’s also another important bit missing - most fonts are subset into the pdf. Ie, only the glyphs that are needed are maintained in the font. I think that’s often where the re-encoding happens. ToUnicode is maintained to allow you to copy text (or search in a PDF). It’s a nice to have for users (in my experience it’s normally there and correct though).

4 comments

azangru 1016 days ago

> I’m a sadist, I read it for fun.

I think this is called masochist. Now, if you participated in writing the spec or were making others read it...

aidos 1016 days ago

Yup, slip of the tongue. Though, I do make other people read the spec at work, so I’m that too.

blur13 1015 days ago

a sadist is a masochist who follows the golden rule

esafak 1016 days ago

It is a shame Adobe designed a format so hard to work with that people are amazed when someone accomplishes what should be a basic task with it.

Their design philosophy of creating a read-only format was flawed to begin with. What's the first feature people are going to ask for??

pwg 1016 days ago

> It is a shame Adobe designed a format so hard to work with

PDF was not designed to be editable, nor for anyone to "work with" it in that way.

It was designed (at least the original purpose circa 1989) to represent printed pages electronically in a format that would view and print identically everywhere. In fact, the initial advertising for the "value" of the PDF format was exactly this, no matter where a recipient viewed your PDF output, it would look, and print, identically to everywhere else.

It was originally meant to be "electronic paper".

dylan604 1016 days ago

Wasn't the PDF format based on the Illustrator format?

The weird thing to me is people using a distribution format as an original source. It's right up there with video cameras shooting an acquisition source as an MP4 and all of the negative baggage that comes with that.

mistrial9 1016 days ago

1.4.4 Portable Document Format (PDF) Adobe has specified another format, PDF, for portable representation of electronic documents. PDF is documented in the Portable Document Format Reference Manual. PDF and the PostScript language share the same underlying Adobe imaging model. A document can be converted straightforwardly between PDF and the PostScript language; the two representations produce the same output when printed. However, PDF lacks the general-purpose programming language framework of the PostScript language. A PDF document is a static data structure that is designed for efficient random access and includes navigational information suitable for interactive viewing.

-- https://www.adobe.com/jp/print/postscript/pdfs/PLRM.pdf

j45 1011 days ago

This is a very valuable link - just generate PDFs yourself by hand or script.

Denvercoder9 1016 days ago

> The weird thing to me is people using a distribution format as an original source.

Every distribution format will inevitably end up being used as a source; as the originals get lost in the mists of time.

userbinator 1016 days ago

I believe Illustrator format is very similar to PostScript.

mistrial9 1016 days ago

.. waves to Leonard Rosenthol

gobdovan 1016 days ago

If you find pleasure in something that gives you pain, you're a masochist. A sadist likes inflicting pain onto others. Since you seem that you like helping people I'd say it's more likely you're the former. I appreciate the mutool advice!

haolez 1016 days ago

That's awesome. I'm relying a lot on Amazon Textract for my PDF parsing needs.

Do you have any other insights on how to do a good job at that natively, i.e. without a cloud provider? Especially when dealing with tables.

kccqzy 1016 days ago

PDF format does not give you enough semantic information to understand there is a table. The stream contains instructions such as moving to a coordinate, adding some text, adding some lines. No tool can extract tables with 100% precision.

haolez 1015 days ago

Yeah, but Textract uses OCR/computer vision even in PDFs with embedded text data and it can extract tables incredibly well. I believe there isn't an open source equivalent. Maybe some advanced usage of tesseract?

aidos 1015 days ago

This seems to have stalled but if popped up a few times on HN in the past. Might still be worth a look.

https://github.com/tabulapdf/tabula

Are the documents scans, or do they have real text on them? It’s worth trying to convert them to svg or html using “mutool convert” and then seeing what you can do with the results. If you’re dealing with the same type of document each time you’ll probably find the patterns in there are common enough that you can easily grab what you want.