Hacker News new | ask | show | jobs
by slashdot2008 1808 days ago
it would be nice if it were a PDF I could download and save for later.

is there any way to turn a series of pages in to a PDF? like a recursive wget and then pipe through pandoc?

5 comments

Lots of ways to do this, but one way is install poppler-utils so you get pdfunite, make sure your filenames for the pages lexicographically sort in the order you want the pages to end up[1], then do

    pdfunite page*.pdf output.pdf
I have had decent results using pdftk as well to do pdf surgery so that's another option.

In this case, if you do a recursive wget I think it should "just work" because the files are named in a friendly way.

So, putting it all together:

     wget -r 'https://dropbox.github.io/dbx-career-framework/overview.html'
     cd dropbox.github.io/dbx-career-framework
     ls ic*software*.html | sed 's/.html$//' | while read f ; do
          pandoc --pdf-engine=wkhtmltopdf $f.html -o $f.pdf
     done
    pdfunite ic*.pdf output.pdf
[1] ie the ordering of the output of "ls" is the order you want the pages in the output pdf
A bit more manual, but I've been saving webpages I like in Obsidian.

First, click the reader view in Firefox, then select all, then paste it into a new Obsidian page. It's really good at keeping a nice formatting and importing pictures etc. You can then export the result to PDF if so desired.

Not sure why you are downvoted, I save everything I want to refer to again as pdf because stuff on the web disappears. I can search all the pdf I have offline with Qiqqa or mendeley. Used to use google desktop for pdf search.
Check out https://archivebox.io/ for a great self hosted solution. Its one of the best such programs I've found.

You can hack together some scripts to do the basics yourself, but archiving arbitrary pages is pretty difficult to get right.

Print to PDF, most browsers support it natively if your OS doesn't already.
Unless I'm mistaken, this doesn't print pages recursively