| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ikreymer 3282 days ago

Unfortunately, this approach alone will only work for sites that are mostly static, eg. do not use JS to load dynamic content. That is a small (and shrinking) percent of the web. Once JS is involved, all bets are off -- JS will attempt to load content via ajax, or generate new html, load iframes, etc and you will have 'live leaks' where the content seems to be coming form the archive but is actually coming form the live web.

Here is an example from archiving nytimes home page:

https://archive.tesoro.io/665dbeab57a4d57d8140f89cfedc69b5

If you look at network traffic (domain in devtools), you'll see that only a small % is coming from archive.tesoro.io -- the rest of the content is loaded from the live web. This can be misleading and possibly a security risk as well.

Not to discourage you, but this is a hard problem and I've been working on for years now. This area is a moving target, but we think live leaks are mostly eliminated in Webrecorder and pywb, although there are lots of areas to work on to maintain high-fidelity preservation.

If you want chat about possible solutions or want to collaborate (we're always looking for contributors!), feel free to reach out to us at support [at] webrecorder.io or find my contact on GH.