Hacker News new | ask | show | jobs
by masklinn 2497 days ago
I don't know that it would be a very useful thing to do at least in the short term: there's a bunch of "web archive" formats out there and the common thread between them is that they're custom archive formats, you need special clients or support for those formats:

* mthml encodes the page as a multipart MIME message (using multipart/related), essentially an email (you're usually able to open them by replacing the .mth by .eml)

* WARC is its own thing with its own spec

* WAFF is a zipfile, not sure about the specifics

* webarchive is a binary plist, not sure about the specifics either

Your tool generates straight HTML which any browser should be able to open. It probably has more limitations, but it doesn't require dedicated client / viewer support.

Maybe once you've got all the fetching and extracting and linking nailed down it would be a nice extension to add "output filters", but that seems more like a secondary long-term goal, especially as those archive formats are usually semi-proprietary and get dropped as fast as they get created (WARC might be the most long-lived as it descends from the Internet Archive's ARC, is an ISO standard and is recognised as a proper archival format by various national libraries).

1 comments

There isn't much to WAFF. Each WAFF file can contain more than one saved page. Each page needs to be contained within its own folder (whose name is usually the timestamp of when the page was saved, but it doesn't matter AFAICT). There can be an `index.rdf` file in there, to specify metadata and which file to open, but otherwise you should look for an `index.SOMETHING` file - usually `index.html`.

E.g.

  test.maff
  `--  1566561512/
       |--  index.rdf
       |--  index.html
       `--  index_files/
            `--  ???
When I was messing around with archiving things locally I settled on WAFF, because it's pretty much trivial to create and to use. Even if your browser does not support it, you just need to unpack it to a tempdir and open the index file.