Hacker News new | ask | show | jobs
by m12k 1556 days ago
I recently discovered that search-replacing text in a PDF without changing the layout is much harder than I thought it would be (a customer forgot to change their billing address, and now that the invoice is finalized, Stripe won't let me edit anything, so down the PDF-editing rabbithole I went). I would love it if I could just use an API for this.
5 comments

There are so many ways to layout text on a PDF page, that this is nearly impossible to implement for all scenarios. I don't know a PDF editor which works in all cases.

Sometimes text is positioned absolute to the page border, sometimes relative to other elements, where moving a word shifts all following elements around. There can be multiple matrices involved for positioning text elements. Sometimes text elements are all positioned independently, sometimes by using newlines with custom size. Text elements can span multiple lines or words but sometimes each letter is a single text element where it is even hard to determine, which letters go together or if there's meant to be a space. Additionally fonts can be subsetted, where it's impossible to use other unused letters without knowing the original font. And than there can be OCR'ed PDF's, where an image of scanned text is overlayed on top of the real text. Oh and there can be clipping paths: Rectangles which erase all text below.

And each PDF-Producer creates a different PDF structure.

For reading, PDF's are awesome. For editing, PDF's are a nightmare.

If it's just one off, I'd draw a white rectangle over the text that needs to be changed, then add the text on top of that.
This isn't easy because PDFs are PostScript, so text is laid out absolutely. You can make very small changes but a larger change requiring a reflow of the text would break things. In some cases it is possible to convert the PDF to a Word document, make edits, and then save it back to a PDF.
You only need an Acrobat Pro for that.* That's daily business for me, although not with invoices but printing data.

* (becomes harder when the font is not embedded/existent as a subset, but Acrobat let's you choose another font, so no big deal.)

LibreOffice Draw edits PDFs pretty well.
What I used for this exact problem was pdftk's `stamp` option, with a stamp pdf that was just a white rectangle with text on it, as a sibling commenter mentioned. Worked for several hundred documents!