Hacker News new | ask | show | jobs
by lcampbell 2348 days ago
I've always thought of PDF as an opaque format. How do you search and/or browse your collection? Does the subject show up in grep [e: without being diligent with filenames]?
2 comments

I (*nix user) use a script that basically does:

    pdftotext -layout -eol unix -nopgbrk  $PDF | egrep ...
Many PDFs have compressed content streams, plain text utilities only see metadata in that case. Cached, compressed text-only output is usually tiny, and can be zgrep-ed.

pdfinfo shows document metadata (title, subject, keywords and more), but it's quite uncommon for these to be useful (Adobe and LᴬTᴇX-sourced PDFs tend to have this data).

Both come with xpdf.

This great; thanks for sharing!
I'd be interesting in knowing this too. It sounds like a good idea.