Hacker News new | ask | show | jobs
by lxgr 881 days ago
I use a reflowing PDF reader app for iOS, which works surprisingly well for many PDFs, given that it internally effectively has to OCR them as if they were scanned paper documents.
1 comments

It has long been my understanding that while PDF can be (and too-often is) just pictures of text, PDF can also have embedded Postscript fonts that are rendered upon display from plaintext strings.

Is that not the case?

Yes, but each string to be rendered is one line of text at best (meaning you need to detect line wraps and heuristically distinguish them from paragraph breaks), and often just a single word or even letter at worst (due to line spacing, letter spacing, and kerning).

It’s also not ASCII or Unicode by default, but rather a list of font glyphs that might or might not have metadata associated with them that maps them back to Unicode codepoints. Accents and diacritics can be rendered as individual strokes as well.

Effectively, you’ll always need some level of OCR-like processing.

Perhaps so.

Should I be angry with Adobe about this?

(I'm still kind of leery about the move from things like groff as a quasi-presentation format, but I may come from simpler times.)