Hacker News new | ask | show | jobs
by JackC 3274 days ago
For personal web archiving, I highly recommend http://webrecorder.io. The site lets you download archives in standard WARC format and play them back in an offline (Electron) player. It's also open source and has a quick local setup via Docker - https://github.com/webrecorder/webrecorder .

Webrecorder is by a former Internet Archive engineer, Ilya Kreymer, who now captures online performance art for an art museum. What he's doing with capture and playback of Javascript, web video, streaming content, etc. is state of the art as far as I know.

(Disclaimer - I use bits of Webrecorder for my own archive, perma.cc.)

For OP, I would say consider building on and contributing back to Webrecorder -- or alternatively figure out what Webrecorder is good at and make sure you're good at something different. It's a crazy hard problem to do well and it's great to have more ideas in the mix.

6 comments

Seconding Webrecorder (and the newly updated WAIL) - I had the chance of meeting Ilya Kremer at a conference a few weeks ago, and I can confirm what he's doing is top notch - I'm hoping to see more work around WARC viewing and sharing in the future.

(Disclaimer: I also do personal archiving stuff with getkumbu)

Hi motdiem,

Thank you for seconding the newly updated WAIL. I am the maintainer/creator of the newly update WAIL (the Electron version) https://github.com/N0taN3rd/wail

I was unable to attend IIPC Web Archiving Conference (WAC) but the original creator of WAIL(Python) Mat Kelly did attend (we both are apart of the same research group WSDL).

If you or anyone else have any questions about WAIL I am more than happy to answer them.

Is offline playback still relevant in the age of ubiquitous always connected Internet?
If your intention is to have a local archive of an online site, yes.
In my use case, some content is only available for a short window. If I want to refer to it, the URL will not work. This happens to me a lot of Wikipedia where a referenced URL is no longer working (linkrot). We need to a better way to track previous versions or access 404 pages that were previously alive.
Definitely, sites and content become inaccessible all the time.

For instance, I back up all new videos of my favorite YouTubers in case they are taken down (e.g. in the case of a copyright claim).

Last I played with it, the latency on webrecorder was uncomfortably high for always-on recording of personal web usage (the pages only display once fully rendered). I wish webpages would render as normal and get asynchronously archived once loading is complete.

That would allow constant archival of every webpage a user ever visits -- an immutable record over the user's years of crawling the web.

> That would allow constant archival of every webpage a user ever visits -- an immutable record over the user's years of crawling the web.

This is usually solved by using a proxy: http://netpreserve.org/projects/live-archiving-http-proxy/

Can this be combined with webrecorder? Does anyone know someone who's done this? I only use about 30GB of traffic a month so a 2TB $70 hard drive would last me almost six years.
Thanks Jack for mentioning Webrecorder! This is a project I started and it is now part of rhizome.org, a non-profit dedicated to promoting internet-based art and digital culture.

I thought I’d add a few notes here, as there’s a few ways you can use Webrecorder and related tools.

First, Webrecorder supports two distinct modes:

- Native recording mode — http/s traffic goes to through the browser and is rewritten to point to the Webrecorder server (This is the default).

- Remote browser mode — Webrecorder launches a browser in Docker container, and streams the screen to your browser (using noVNC). The traffic is either recorded or replayed depending on the mode, but the operation is the same (we call this ‘symmetrical archiving’) This gives you a recording proxy w/o having to configure your browser or install any plugins.

You can choose this mode by clicking the dropdown to choose a browser (currently Chrome and FF) This is essentially a remote browser configured via HTTP/S proxy, and allows us to record things like Flash and even Java applets, and other technologies that may become obsolete.

- We also have a desktop player app, Webrecorder Player, available for download from: https://github.com/webrecorder/webrecorderplayer-electron/re...

This is an app that plays back WARCs files (created by Webrecorder and elsewhere), and allows browsing any WARC file offline.

Another way to create a web archive (for developers): You can use the devtools in the browser to export HAR files, and Webrecorder and Webrecorder Player will convert them on the fly and play them back. Unfortunately, this option is sort of limited for developers, but you can actually create a fairly good archive locally using HAR export (available in Chrome and Firefox at least). The conversion is done using this tool: https://github.com/webrecorder/har2warc

- If you use webrecorder.io, you can register for an account or use it anonymously. If you register for an account, we provide 5GB storage and you have a permanent url for your archive. You can also upload existing WARCs (or HARs)

- You can also run Webrecorder on your own! The main codebase is at: https://github.com/webrecorder/webrecorder and the remote browser system is actually a separate component and was first used for oldweb.today and lives at https://github.com/oldweb-today

Finally, the core replay/recording tech is actually a separate component, an advanced ‘wayback machine’ being developed in https://github.com/ikreymer/pywb

There’s a lot of different components here, and we would definitely appreciate help to any and all parts of the stack if anyone is interested! All our work is open-source and we are a non-profit, so any help is appreciated.

Wow, webrecorder seems very cool, especially since it's OSS. Is there any way to set it up to record all incoming traffic? In these days of cheap storage that'd be very cool. I know I personally only use about 30GB a month, so a $70 2TB hard drive would last me five and a half years of browsing.
If you're a coder I bet you could hack it to do that. It has an amazing containerized browser mode where you can browse in a remote browser via VNC, with the remote browser set up to use a WARC-writing proxy. So the general outline would be to run it locally in Docker; expose the proxy port used by the containerized browsers; and configure your own browsers to use the same proxy.

I'm not sure how much this would interfere with normal browsing -- it's not a typical usecase.

This is great and helps me a ton, thanks for mentioning it here.
Thanks Jack, I hadn't heard of webrecorder before, but I'll check it out. :)