| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by JackC 3321 days ago

For personal web archiving, I highly recommend http://webrecorder.io. The site lets you download archives in standard WARC format and play them back in an offline (Electron) player. It's also open source and has a quick local setup via Docker - https://github.com/webrecorder/webrecorder .

Webrecorder is by a former Internet Archive engineer, Ilya Kreymer, who now captures online performance art for an art museum. What he's doing with capture and playback of Javascript, web video, streaming content, etc. is state of the art as far as I know.

(Disclaimer - I use bits of Webrecorder for my own archive, perma.cc.)

For OP, I would say consider building on and contributing back to Webrecorder -- or alternatively figure out what Webrecorder is good at and make sure you're good at something different. It's a crazy hard problem to do well and it's great to have more ideas in the mix.

6 comments

motdiem 3321 days ago

Seconding Webrecorder (and the newly updated WAIL) - I had the chance of meeting Ilya Kremer at a conference a few weeks ago, and I can confirm what he's doing is top notch - I'm hoping to see more work around WARC viewing and sharing in the future.

(Disclaimer: I also do personal archiving stuff with getkumbu)

link

johnaberlin 3320 days ago

Hi motdiem,

Thank you for seconding the newly updated WAIL. I am the maintainer/creator of the newly update WAIL (the Electron version) https://github.com/N0taN3rd/wail

I was unable to attend IIPC Web Archiving Conference (WAC) but the original creator of WAIL(Python) Mat Kelly did attend (we both are apart of the same research group WSDL).

If you or anyone else have any questions about WAIL I am more than happy to answer them.

link

amrrs 3321 days ago

Is offline playback still relevant in the age of ubiquitous always connected Internet?

link

kchr 3321 days ago

If your intention is to have a local archive of an online site, yes.

link

WhiteOwlLion 3321 days ago

In my use case, some content is only available for a short window. If I want to refer to it, the URL will not work. This happens to me a lot of Wikipedia where a referenced URL is no longer working (linkrot). We need to a better way to track previous versions or access 404 pages that were previously alive.

link

an27 3321 days ago

Definitely, sites and content become inaccessible all the time.

For instance, I back up all new videos of my favorite YouTubers in case they are taken down (e.g. in the case of a copyright claim).

link

shasheene 3321 days ago

Last I played with it, the latency on webrecorder was uncomfortably high for always-on recording of personal web usage (the pages only display once fully rendered). I wish webpages would render as normal and get asynchronously archived once loading is complete.

That would allow constant archival of every webpage a user ever visits -- an immutable record over the user's years of crawling the web.

link

unicornporn 3321 days ago

> That would allow constant archival of every webpage a user ever visits -- an immutable record over the user's years of crawling the web.

This is usually solved by using a proxy: http://netpreserve.org/projects/live-archiving-http-proxy/

link

owenversteeg 3321 days ago

Can this be combined with webrecorder? Does anyone know someone who's done this? I only use about 30GB of traffic a month so a 2TB $70 hard drive would last me almost six years.

link

ikreymer 3321 days ago

Thanks Jack for mentioning Webrecorder! This is a project I started and it is now part of rhizome.org, a non-profit dedicated to promoting internet-based art and digital culture.

I thought I’d add a few notes here, as there’s a few ways you can use Webrecorder and related tools.

First, Webrecorder supports two distinct modes:

- Native recording mode — http/s traffic goes to through the browser and is rewritten to point to the Webrecorder server (This is the default).

- Remote browser mode — Webrecorder launches a browser in Docker container, and streams the screen to your browser (using noVNC). The traffic is either recorded or replayed depending on the mode, but the operation is the same (we call this ‘symmetrical archiving’) This gives you a recording proxy w/o having to configure your browser or install any plugins.

You can choose this mode by clicking the dropdown to choose a browser (currently Chrome and FF) This is essentially a remote browser configured via HTTP/S proxy, and allows us to record things like Flash and even Java applets, and other technologies that may become obsolete.

- We also have a desktop player app, Webrecorder Player, available for download from: https://github.com/webrecorder/webrecorderplayer-electron/re...

This is an app that plays back WARCs files (created by Webrecorder and elsewhere), and allows browsing any WARC file offline.

Another way to create a web archive (for developers): You can use the devtools in the browser to export HAR files, and Webrecorder and Webrecorder Player will convert them on the fly and play them back. Unfortunately, this option is sort of limited for developers, but you can actually create a fairly good archive locally using HAR export (available in Chrome and Firefox at least). The conversion is done using this tool: https://github.com/webrecorder/har2warc

- If you use webrecorder.io, you can register for an account or use it anonymously. If you register for an account, we provide 5GB storage and you have a permanent url for your archive. You can also upload existing WARCs (or HARs)

- You can also run Webrecorder on your own! The main codebase is at: https://github.com/webrecorder/webrecorder and the remote browser system is actually a separate component and was first used for oldweb.today and lives at https://github.com/oldweb-today

Finally, the core replay/recording tech is actually a separate component, an advanced ‘wayback machine’ being developed in https://github.com/ikreymer/pywb

There’s a lot of different components here, and we would definitely appreciate help to any and all parts of the stack if anyone is interested! All our work is open-source and we are a non-profit, so any help is appreciated.

link

owenversteeg 3321 days ago

Wow, webrecorder seems very cool, especially since it's OSS. Is there any way to set it up to record all incoming traffic? In these days of cheap storage that'd be very cool. I know I personally only use about 30GB a month, so a $70 2TB hard drive would last me five and a half years of browsing.

link

JackC 3321 days ago

If you're a coder I bet you could hack it to do that. It has an amazing containerized browser mode where you can browse in a remote browser via VNC, with the remote browser set up to use a WARC-writing proxy. So the general outline would be to run it locally in Docker; expose the proxy port used by the containerized browsers; and configure your own browsers to use the same proxy.

I'm not sure how much this would interfere with normal browsing -- it's not a typical usecase.

link

psteinweber 3321 days ago

This is great and helps me a ton, thanks for mentioning it here.

link

agamble 3321 days ago

Thanks Jack, I hadn't heard of webrecorder before, but I'll check it out. :)

link