| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jeremynixon 1308 days ago
	Why is the state of the art in PDF parsing SO BAD? This is an incredibly common and important problem. Tika and fitz have very poor results. What is the reason that this is still so backwards?

7 comments

jahewson 1308 days ago

Despite the thousands of pages of ISO 32000, the reality is that the format is not defined. Acrobat tolerates unfathomably malformed PDF files generated by old software that predates the opening-up of the standard when people were reverse-engineering it. There’s always some utterly insane file that Acrobat opens just fine and now you get to play the game of figuring out how Acrobat repaired it.

Plus all the fun of the fact that you can embed the following formats inside a PDF:

PNG, JPEG (including CMYK), JPEG 2000 (dead), JBIG2 (dead), CCIT G4 (dead, fax machines), PostScript Type1 fonts (dead), PostScript Type3 fonts (dead), PostScript CIDFonts (pre-Unicode, dead), CFF fonts (the inside of an OTF), TrueType fonts, ICC Profiles, PostScript functions defining Color spaces, XML forms (the worst), LZ compressed data, Run-length compressed data, Deflate-compressed data.

All of which Acrobat will allow to be malformed in various non-standard ways so you need to write your own parsers.

Note the lack of OpenType fonts, also lack of proper Unicode!

userbinator 1308 days ago

JPEG 2000 (dead)

Not sure what you mean by "dead", but tons of book scans, particularly those at archive.org, are PDFs of entirely JPEG2000 images.

joe_guy 1308 days ago

Believe it or not, but digital cinema projection is done with jpeg 2000 https://en.wikipedia.org/wiki/Digital_cinema

jahewson 1308 days ago

That’s wild!

jahewson 1308 days ago

I mean dead as in the fact that it’s used somewhere is noteworthy.

I’d love for JPEG XL to replace such uses!

jwilk 1308 days ago

> lack of proper Unicode

What do you mean?

pwg 1308 days ago

PDF was defined way back before Unicode was ever a thing. It is natively an 8-bit character-set format for text handling. The way it gets around this limit of only 256 characters available is because it also allows defining custom byte to character glyph mappings (think both ASCII and EBCDIC encoding in different parts of the same document). To typeset a glyph that is not in the current in use 256 character sub-set mapping you switch to a different custom byte value to character glyph mapping to typeset that other character.

jwilk 1308 days ago

> PDF was defined way back before Unicode was ever a thing.

Unicode 1.0 was released in 1991.

PDF 1.0 was released in 1993.

dbrueck 1308 days ago

Others have given some good answers already, but I'll add one more: PDF is all about creating a final layout, but given a final layout, there are an infinite number of inputs that could have produced it, but if you are parsing the PDF, most of the time you are trying to get back something higher level than e.g. a command to draw a single character at some X,Y location. But many PDFs in the wild were not generated with that type of semantic extraction in mind, so you have to sort of fudge things to get what you want, and that's where it becomes complex and crazy.

For example, I once had to try to parse PDF invoices generated by some legacy system, and at the bottom was a line that read something like, "Total: $32.56". But in the PDF there was an instruction to write out the string "Total:" and a separate one to write out the amount string, but there was nothing in the PDF itself that correlated the two in any way at all (they didn't appear anywhere close to either other in the page's hierarchy, they weren't at a fixed set of coordinates, etc, etc.).

layer8 1308 days ago

1. PDF is mostly designed as a write-only (or render-only) format. PDF’s original purpose was as a device-independent page output language for printers, because PostScript documents at the time were specific to the targeted printer model. Interpreting a PDF for any other purpose than rendering resembles the task of disassembling machine code back into intelligible source code.

2. Many PDF documents do not conform to the PDF specification in a multitude of ways, yet Adobe Acrobat Reader still accepts them, and so PDF parsers have to implement a lot of kludgy logic in an attempt to replicate Adobe’s behavior.

3. The format has grown to be quite complex, with a lot of features added over the years. Implementing a parser even for spec-compliant PDFs is a decidedly nontrivial effort.

So PDF is a reasonably good output format for fixed-layout pages for display and especially for print, but a really bad input format.

autotune 1308 days ago

My current company uses ML to parse PDF invoices and identify fraud. I have no idea how the devs manage this black magic wizardry because they also spend time contributing to infra code before they hired more people like me on board. If anyone wants a great startup idea, look to solving a problem involving parsing PDFs en masse. Maybe something in legal tech. That market is absolutely ripe for disruption.

bmitc 1308 days ago

Droit does similar things.

autotune 1308 days ago

That is awesome! Can “cross border” do things like process GDPR compliance regulations or is that not the intended use case?

newsclues 1308 days ago

PDF has always seemed to be a janky Adobe product.

Should a modern, open version of PDF be created knowing that how it evolved from the original concept in 1991? Shouldn't we at some point say, we need to start over and created PDF2?

jahewson 1308 days ago

That would be XPS https://en.m.wikipedia.org/wiki/Open_XML_Paper_Specification

copperbrick25 1308 days ago

Sadly XPS is not supported by most software, I'd love to use something better than PDF, but even LibreOffice can't export as OXPS.

userbinator 1308 days ago

Anything related to XML is arguably even worse.

mdaniel 1308 days ago

I know it's fun to hate on XML, but as compared to inventing a new pseudo-text-pseudo-binary format, its parsing mechanics are well understood

I'm not claiming all of PDF's woes are related to its encoding, but it's not zero, either. Start from the fact that XML documents have XML Schema allowing one to formally specify what can and cannot appear where. The PDF specification is a bunch of English which makes for shitty constraint boundaries

manv1 1308 days ago

It was a fight between DiskPaper and PDF. PDF won because the tools were better and it was cross-platform.

And PDF is a subset of PostScript, the product that made Adobe and the DTP industry.

It's janky because the goal was to render identically everywhere. If you think it's easy look at the code abortion that is CSS.

mdaniel 1308 days ago

I know this is likely a case of "you know what I meant," but there already is a PDF 2.0: https://www.pdfa.org/resource/iso-32000-pdf/

steampilot 1308 days ago

I think it's not too late to create a modern open-source alternative to PDF. I find it unacceptable that something that has become so widely used doesn't have proper free tools for editing. Society shouldn't be limited by income if they want (have?) to use PDFs, or else suffer from a bad experience. The other bigger problem with PDF is that a lot of the times it's used for something for which it wasn't made to be used for. Anything that is expected to be consumed on both mobile and desktop devices should never use PDFs. Government forms should not use PDFs with hacky embedded scripts either.

macintux 1308 days ago

It does seem like that would be a good opportunity to weed out some of the insecure aspects of the format.

Unfortunately in practice it would mean that everyone would have to support both PDF and PDF2.

brailsafe 1308 days ago

It's not that bad, it's just that the problem is big enough in scope to get right that the state of the art is provided by private industry, and to a lesser extent some open source tools, and you're probably way better off joining them rather than trying to beat them unless you want to grind your brain into the dust for a page layout spec.

userbinator 1308 days ago

The PDF format is itself a weird hybrid of text and binary.

(I have written a PDF parser myself.)