Hacker News new | ask | show | jobs
by bravura 4788 days ago
Is there an open-source library for archiving a URL, including all assets (JS, graphics, etc.)?

This task is trickier than it initially seems.

I'd love to have a local cache of bookmarked URLs.

4 comments

Wget can do it, with variable success.

I thought about building one in PhantomJS, since it seems it can extract all the required assets, but gave up on it since most of the content I care about is in my RSS archive anyway.

Wget is good if you have the right options (http://psung.blogspot.com/2008/06/using-wget-or-curl-to-down...).

For some audio/video sites there is youtube-dl, there is a list of sites it handles here: http://rg3.github.io/youtube-dl/documentation.html .

EDIT: there is a writeup about that is partially based on wget here: http://www.gwern.net/Archiving%20URLs#local-caching (also includes extracting URL from firefox history)

EDIT2: and if you're really desperate you could always use tcpdump/mitmdump and something like this: http://justniffer.sourceforge.net/#!/justniffer_grab_http_tr...

What about browser -> save as -> save as type "Website, complete"? Not really a library, and not really something you could do retro-actively to urls already bookmarked without revisiting the pages, but it sort of works.

Regarding a library for archiving a URL - it'd be interesting if there were a way to save a url by inline css and javascript in the html, and converting images and other assets to data urls.

If you don't mind free but closed source, Evernote.
HTTrack (http://www.httrack.com/), maybe...