Hacker News new | ask | show | jobs
by ZoomZoomZoom 653 days ago
I recently wanted to edit out a huge background image repeating on almost every page of a PDF and found out there's no obvious way to do it.

Would appreciate any tool suggestions!

4 comments

This is probably a simple find-and-replace task, so I wouldn't bother with proper PDF parsing or libraries. I would:

1. Use pdftk to uncompress it: pdftk input.pdf output uncompressed.pdf uncompress

2. Look at the PDF code (it's text based) to find the image insertion code.

3. Replace all instances of the image insertion code with strings of spaces the same length (there's a table of object byte offsets at the end that you don't want to mess up).

4. Use pdftk to compress it again: pdftk edited.pdf output output.pdf compress

I have a script that does this to remove pen strokes of particular colours so I can e.g. strip out marking rubric on test solutions written on a tablet.

Get the PDF 1.7 spec from https://pdfa.org/resource/pdf-specification-archive/. You're looking for the "Do" operator invoking a named image object defined elsewhere with "/Subtype /Image". See section 4.8, particularly the example on p343. Or, if it's badly done, it might instead be an inline image using the "BI" operator (a bit later in the same section).

I've had good experience with pypdf, if you're willing to do a little coding.
If you're OK doing it manually (not scripted), Inkscape can do this.
You could try one of Adobe's PDF APIs or script their software locally.