Hacker News new | ask | show | jobs
by JackC 495 days ago
Hi! Perma is made by the Harvard Library Innovation Lab, which I direct, and I wrote a bunch of the early code for it back in 2015 or so.

For HN readers, I'd suggest checking out https://tools.perma.cc/, where we post a bunch of the open source work that backs this. Due to the shift from warc to wacz, (a zipped-web-archive format developed by WebRecorder), it's now possible to pass around fully interactive high fidelity web archives as simple files and host them with client side javascript, which opens up a bunch of new possibilities for web archive designs. You can see some tech demos of that at our page https://warcembed-demo.lil.tools/ , where each page is just a static file on the server and some client side javascript.

It's best to think of Perma.cc itself, the service, as some UX and user support wrapping to help solve linkrot primarily in the law journal, courts, law journals, and journalists area (for example, dashboards for a law journal to collaborate on the links they're archiving for their authors), and our work on this as building from that usecase to try to make it easier for everyone to build similar things.

I saw some mentions of the Internet Archive, which is great, and is also kind enough to keep a copy of our archives and expose them through the Wayback Machine. One thing I've been thinking about recently in archiving is that there's a risk to overstandardizing -- you don't want things too much captured with the same software platforms, funded through the same models, governed by the same people, exposed through the same interfaces, etc. There's supposed to be thousands of libraries, not one library. Unlike "don't roll your own crypto," I'd honestly love to see more people roll their own archives.

Happy to answer any questions!

5 comments

My first question was "If this is a free service, how do I know it will still be around in even a few years?". This was answered by your comment that it is (or at least appears to be?) funded by Harvard.

In which case, why isn't this prominently displayed on the main page? Or why not use a Harvard library URL, which will significantly boost the trust level? Especially vs a CC TLD which are known to be problematic?

It is on core Harvard funds, and we also have paid accounts used by law firms and journalists.

As an innovation lab we often minimize Harvard branding with project websites because it's more instructive to win or lose on our own merits than based on how people feel about Harvard, in either direction.

Yeah, but the the mare success of a service like perma.cc relies on trust. How does someone trust you that you will be here in 10, 20, etc years?

Harvard has been around for hundreds of years, Harvard has inbuilt trust, Harvard has funding. You should negotiate and arrange going behind its brand.

Things like https://perma.cc/sign-up/courts

It's in several US states' interest to make sure this service keeps existing.

wow hats off man!

But I am also wondering , is this sustainable , can I use this to archive this hackernews itself for example

And how can we verify the integrity of the web archive , could you please explain?

Thanks a lot in advance. I wish you all the best in your career & this project!

I guess it’s not sufficiently prominent (given that you didn’t see it) but this is discussed in detail in the FAQ section
I think the main question is:

- Why is it better than internet archive?

I personally see the benefit as potentially having internet archive stopping being the only game in town, but even that comes with certain costs ( which may not be great to the community as a whole -- depending on who you ask ).

I would love to hear your perspective on where you stand as related to other providers of similar services.

I think the biggest distinction is between archiving platforms made primarily for authors and primarily for web crawlers.

If you're an author (say, of a court decision) and you archive example.com/foo, Perma makes a fresh copy of example.com/foo as its own wacz file, with a CPU-intensive headless browser, gives it a unique short URL, and puts it in a folder tree for you. So you get a higher quality capture than most crawls can afford, including a screenshot and pdf; you get a URL that's easy to cite in print; you can find your copy later; you get "temporal integrity" (it's not possible for replays to pull in assets from other crawls, which can result in frankenstein playbacks); and you can independently respond to things like DMCA takedowns. It's all tuned to offer a great experience for that author.

IA is primarily tuned for preserving everything regardless of whether the author cared to preserve it or not, through massive web crawls. Which is often the better strategy -- most authors don't care as much as judges about the longterm integrity of their citations.

This is what I'm getting at about the specific benefits of having multiple archives. It's not just redundancy, it's that you can do better for different users that way.

> - Why is it better than internet archive?

With the internet archive, the purpose seems to be for public archiving. One could imagine a use-case where you want non-public archives, and are therefore not subject to any take-down requests, especially if they are considered court evidence for example.

By paying directly for your links to be archived, it directly helps fund the service and therefore keep it going. You would want to see some guarantees in the contract about pricing if you were to long-term rely on the service.

Irrelevant. The point is that there shouldn't be a single archive for anything, because then it has the longevity of the operators. Who can say whether Harvard or the IA will close its service first? Why choose ?
Is there any concept of signing data at time of archive, and verification at time of access, to prove it is not later tampered with, say by a bribed sysadmin?

Similarly are there any general supply chain integrity measures in place, such as code review of dependencies, reproducible builds, or creating archives reproducibly in independently administrated enclaves?

You note archives could be used for instances like Supreme Court decisions, so any anyone with power to tamper with content would certainly be targeted.

We're coauthors on the wacz-auth spec, which is designed to solve this sort of thing by signing archives with the domain cert of the archive that created them. If you cross-sign with a private cert you can do pretty well with this approach against various threat models, though it has to be part of a whole PKI security design.

I think the best approach for high stakes archiving is to have a standard for "witness APIs" so that you could fetch archives from independent archiving institutions. That also solves for the web looking different from different places. That hasn't gelled yet, though.

WACZ files created by WebRecorder software like archiveweb.page are signed (by you) and timestamped (by a third party using RFC 3161).
And put the signatures on a blockchain so that the perma.cc holders, or the USA government, can't do easily alter things either.
Since you own the "perma.link" domain name (I just looked it up) why don't you use that instead of .cc which has issues?
It's really annoying that domain is not the main one, it's so much better!
What happens if you get a lawsuit or injunction demanding information removal or alteration? What if somebody archives a born secret or something sensitive?