| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by alexwlchan 788 days ago

1/ Why not wget?

For this project I wanted a consistent file format for my entire collection.

I have a bunch of stuff I want to save which is behind paywalls/logins/clickthroughs that are tricky for wget to reach. I know I can hand wget a cookies file, but that’s mildly fiddly. I save those pages as Safari webarchive files, and then they can drop in alongside the files I’ve collected programatically. Then I can deal with all my saved pages as a homogeneous set, rather than being split into two formats.

Plus I couldn't find anybody who'd done this, and it was fun :D

This is only for personal stuff where I know I'll be using Safari/macOS for the foreseeable future. I don't envisage using this for anything professional, or a shared archive -- you're right that a less proprietary format would be better in those contexts. I think I'm in a bit of a niche here.

(I'm honestly surprised this is on the front page; I didn't think anybody else would be that interested.)

2/ Proprietary format: it is, but before I started I did some experiments to see what's actually inside. It's a binary plist and I can recover all the underlying HTML/CSS/JS files with Python, so I'm not totally hosed if Safari goes away.

Notes on that here: https://alexwlchan.net/til/2024/whats-inside-safari-webarchi...

1 comments

pvg 788 days ago

I didn't think anybody else would be that interested.

'Save the webpage as I see it in my browser' remains a surprisingly annoying and fiddly problem, especially programmatically, so the niche is probably a little roomier than you might initially suspect.

DaSHacka 788 days ago

> 'Save the webpage as I see it in my browser' remains a surprisingly annoying and fiddly problem

You may be interested in SingleFile[1]

[1] https://github.com/gildas-lormeau/SingleFile

I use it all the time to archive webpages, and I imagine it wouldn't be hard to throw together a script to use FireFox's headless mode in combination with SingleFile to selfhost a clone of the wayback machine.

freedomben 788 days ago

This is what I was going to say as well. Somebody on HN told me about SingleFile and I use it all the time now! Really amazing extension.

pvg 788 days ago

Thanks, I've seen it, last I tried it it missed bg images. But my point is this is something browsers should support better and kind of sort of do now but even with that it's a hassle.

tedmiston 788 days ago

I tested this just now on the blog post that this HN page points to and SingleFile handled the background image fine.

cxr 788 days ago

> FireFox's

It's just "Firefox".

sturakov 788 days ago

I've enjoyed using this

https://github.com/webrecorder

It has a standardized format and acts like a recorder for what you see.

factormeta 788 days ago

Thanks all the JS - SPA develops that insisting on putting JS all over the place. Wouldn't it be better to have everything in one .html, using <script> <style> just inline. Then it is also just one file over the internet. There must be a bundler that does that no?

Seems JS developer just want their code to the obfuscated and unachievable as possible unless it is via their web server.

cxr 788 days ago

> using <script> <style> just inline

These SPA bundles are on the order of megabytes, not kilobytes. You want your users, for their own sake and yours, to be able to cache as much as possible instead of delivering a unique megablob payload for every page they hit.

vmfunction 788 days ago

Good point on the cache. However things such as putting background image in CSS, so user can right click to download the image is just stupid. Why is css all the sudden in control of the image display? It just makes archiving pages harder.

diggan 788 days ago

> 'Save the webpage as I see it in my browser' remains a surprisingly annoying and fiddly problem

Is it really? I remember hacking around with with JavaScript's XMLSerializer (I think) like 5 years ago and solved that for ~90% of the websites I tried to archive. It'd save the DOM as-is when executed.

Internet Archive/ArchiveTeam also worked on that particular problem for a very long time, and are mostly successful as far as I can tell.

pvg 788 days ago

90% feels like an overestimate to me but it's already quite poor, you wouldn't accept that for saving most other things. Another problem is highlighted in the piece - it's a hassle to ensure external tools handle session state and credentials. Dynamic content is poorly handled, the default behaviours are miserable (a browser will run random Javascript from the network but not Javascript you've saved, etc).

There's a lot of interest in 'digital preservation' and perhaps one sign of how it's very much early days of the field - it's tricky to 'just save' the results of one of the most basic current computer interactions - looking at a web page.

diggan 788 days ago

But if you serialize the DOM as-is, you literally get what you see on the page when you archive it. Nothing about it is dynamic, and there is no sessions nor credentials to handle. Granted, it's a static copy of a specific single page.

If you need more than that, then WARC is probably the best. For my measly needs of just preserving exactly what I see, serializing the DOM and saving the result seems to do just fine.

pvg 788 days ago

Yes you save something that's mildly better than print-page-to-PDF. But it still misses things and the interactive stuff is very much part of 'exactly what I see'. Like, a random article with an interactive graph, for instance - like this recent HN hit https://ciechanow.ski/airfoil/

It's not that there aren't workarounds, it's that they are clunky and 'you can't actually save the most common computery entity you deal with' is just a strange state of affairs we've somehow Stockholmed ourselves to.

tedmiston 788 days ago

> Internet Archive/ArchiveTeam also worked on that particular problem for a very long time, and are mostly successful as far as I can tell.

One category that the archivers do poorly with is news articles where a pop-up renders on page load which then requires client-side JS execution to dismiss the pop-up.

Sometimes it is easily circumvented by manual DOM manipulation, but that's hardly a bulletproof solution. And it feels automateable.

brnt 787 days ago

Print to PDF seems to be the only way to ensure you record what you saw.