| HN Mirror

There are known command line tools for that [0] since many years. While it's easy to do it on purely text (ASCII) files, it's a bit more work on HTML files or binary files. For them you would probably extract the textual context first (e.g. stripping HTML tags) and then compare the clear text. Alternatively you may render the HTML/PDF file and do visual comparison, then extract the diff text from images.

By default diff programs create a line-based output, but you can change it to minimum per-word highlighting via options (e.g. 'git diff --color-words').

The thing with PDF is that often even when you re-save the same PDF file in the same editor, you would probably get entirely different files. I'm not a PDF expert but from what I've learned, PDF is the type of file that saves kind of vector representation of glyphs and their placements and is often unaware of what that glyph represents (depends perhaps on the program used to create the PDF and options). Importing PDF back to e.g. OpenOffice is an ugly work for the plugings.

There are some exiting solutions for diffing PDFs [1] however I haven't played with them really.

[0] http://en.wikipedia.org/wiki/Diff [1] http://stackoverflow.com/questions/887186/java-pdf-diff-libr...