| 1/ Why not wget? For this project I wanted a consistent file format for my entire collection. I have a bunch of stuff I want to save which is behind paywalls/logins/clickthroughs that are tricky for wget to reach. I know I can hand wget a cookies file, but that’s mildly fiddly. I save those pages as Safari webarchive files, and then they can drop in alongside the files I’ve collected programatically. Then I can deal with all my saved pages as a homogeneous set, rather than being split into two formats. Plus I couldn't find anybody who'd done this, and it was fun :D This is only for personal stuff where I know I'll be using Safari/macOS for the foreseeable future. I don't envisage using this for anything professional, or a shared archive -- you're right that a less proprietary format would be better in those contexts. I think I'm in a bit of a niche here. (I'm honestly surprised this is on the front page; I didn't think anybody else would be that interested.) 2/ Proprietary format: it is, but before I started I did some experiments to see what's actually inside. It's a binary plist and I can recover all the underlying HTML/CSS/JS files with Python, so I'm not totally hosed if Safari goes away. Notes on that here: https://alexwlchan.net/til/2024/whats-inside-safari-webarchi... |
'Save the webpage as I see it in my browser' remains a surprisingly annoying and fiddly problem, especially programmatically, so the niche is probably a little roomier than you might initially suspect.