| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bravura 4788 days ago

Is there an open-source library for archiving a URL, including all assets (JS, graphics, etc.)?

This task is trickier than it initially seems.

I'd love to have a local cache of bookmarked URLs.

4 comments

icebraining 4788 days ago

Wget can do it, with variable success.

I thought about building one in PhantomJS, since it seems it can extract all the required assets, but gave up on it since most of the content I care about is in my RSS archive anyway.

link

notaddicted 4788 days ago

Wget is good if you have the right options (http://psung.blogspot.com/2008/06/using-wget-or-curl-to-down...).

For some audio/video sites there is youtube-dl, there is a list of sites it handles here: http://rg3.github.io/youtube-dl/documentation.html .

EDIT: there is a writeup about that is partially based on wget here: http://www.gwern.net/Archiving%20URLs#local-caching (also includes extracting URL from firefox history)

EDIT2: and if you're really desperate you could always use tcpdump/mitmdump and something like this: http://justniffer.sourceforge.net/#!/justniffer_grab_http_tr...

link

redidas 4788 days ago

What about browser -> save as -> save as type "Website, complete"? Not really a library, and not really something you could do retro-actively to urls already bookmarked without revisiting the pages, but it sort of works.

Regarding a library for archiving a URL - it'd be interesting if there were a way to save a url by inline css and javascript in the html, and converting images and other assets to data urls.

link

pbreit 4788 days ago

If you don't mind free but closed source, Evernote.

link

smacktoward 4788 days ago

HTTrack (http://www.httrack.com/), maybe...

link