Hacker News new | ask | show | jobs
by jdmichal 3334 days ago
Absolutely right. PDF fixes the problem Word documents have where different versions of Word tends to render the document ever so differently. Usually in a way that seems to mess up all those beautiful page breaks you meticulously planned. It does this by every element having absolute positioning.
3 comments

Most people don't know you can open a PDF in Word 2016 and edit it, just like a .doc/.docx file...

Of course, reflowing the document is really painful with that source.

And is the scourge of every paper I want to read on my Kindle. Everyone publishes their PDFs and there seems to be no reliable way to reflow them.
Which is why ePub and other digital book formats exist. I'm glad there is a format that prioritizes WYSIWYG over reflow (though the part of the spec where the introduced scripting is a bit dodgy)
The "Reading Mode" on the Adobe's Acrobat app for Android does a pretty good job of reflowing most of the PDFs I've thrown at it.
Should be able to easily reflow text as long as it's using newline operators (T*, ', "). Might still need some basic heuristics for paragraph breaks. But much better than the alternative of attempting to correlate individual lines of text together based on positioning.
I use calibri for my E-Books, and it seems to do a pretty good job of reflowing PDFs(and exporting to mobi). YMMV though, especially if headers/footers are badly done.
Off-topic, but you seem like you might know: why does text copied from PDFs sometimes have messed-up spaces? It seems to guess where the spaces should go based on kerning, so with justified text, a widely-spaced line may come out with a space between each letter, while a narrowly-spaced one has no spaces at all.

(Also the thing where it inserts line breaks at the end of every print line is maddening)

That's often caused by the font specified in the PDF not being available on the platform where the PDF viewer is running, so a different font has been used instead.
Hmm. I may have been unclear--the PDF reads fine, but if I copy and paste some text into a text editor, I get the messed-up spaces. It seems as if the PDF doesn't encode text as text but just as a series of characters and locations, leaving spaces unrecorded, so when copy-pasting the reader has to guess from the distance between letters.