Hacker News new | ask | show | jobs
by 1vuio0pswjnm7 336 days ago
https://web.archive.org

https://commoncrawl.org

I would prefer more of these.

Alas, archive.today (archive.ph, archive.is, archive.vn, etc.) is sometimes blocked in some countries, it sometimes serves CAPTCHAs, it tries to create a "fingerprint" using Javascript, and it contains a tracking pixel.

Neither Internet Archive nor Common Crawl do those things. (There are other archives I am not mentioning that do not do these things either.)

When it works, archive.today may seem like a perfect solution to "paywalls". And then it stops working. In truth most paywalls are solved by controlling HTTP headers like UA and X-forwarded-for, controlling Javascript and controlling cookies. This control requires no third party intermediary (middleman) like Archive.today. Or Internet Archive, for that matter.

None of these archives are perfect and it's true the public could use more of them. But there are better ways to avoid "paywalls" which are just a means of collecting data about non-subscribers while deliberately annoying them with Javascript.

2 comments

The Internet Archive is significantly less useful because they allow people to exclude their public social media accounts or websites. On a couple occasions I have tried to find a source for old deleted statements using the IA only to find that the data had been scrubbed. Fortunately archive.today still had a copy in one case, but in the other one I was out of luck.
What were you looking for that was prone to scrubbing? Just curious because I have a collection of historical data to go through and don't know what to expect
In one case it was a personal website, the other was a Twitter account. Both got scrubbed from the IA.

Apparently they will comply with GDPR and DMCA requests, I'm not sure what precise mechanism was used in those cases.

https://www.reddit.com/r/privacy/comments/eut3na/can_i_get_p...

https://www.joshualowcock.com/guide/how-to-delete-your-site-...

The Internet Archive operates within the law (mostly), while archive.foo is blatantly illegal, which is why it has so many domain names, among other things. Think Anna's Archive vs Library of Congress.

The future is going to be some kind of bland corporate internet of useless corporate things (only people with a team of lawyers can afford to operate any service on this dark-forest light-network), paired with some kind of dark web full of very useful uncorporate things, which corporations are constantly trying to hunt down, which everyone will use every day.

> The Internet Archive operates within the law

It most certainly does not. The archive is home to petabytes of pirated content, and Jason Scott himself has told people many times on many different platforms/interviews to intentionally upload copyrighted content, because "if we had to police everything, we would have no content... so upload first, then let the rightsholder deal with requesting takedowns".

All you have to do is click on the "software" link at the top of the page, and you can find just about any copyrighted app or game that has ever been released, on any platform, available to download instantly for free. Besides usenet, it's the largest centralized cache of pirated content on the planet.

It's one thing to claim Section 230 because you are a service provider and you don't control what your users upload, but it's entirely another thing to publicly acknowledge that you're aware that people do it, you encourage them to do it, AND you don't care.

And regarding archive.foo, just because they have many domains doesn't make it illegal... it means they have enemies who are guilty of the Streisand Effect. Enemies who are known to attack registrars, DNS providers, upstream ISPs/hosting providers and anyone else who will entertain a false flag attempt at claiming a ToS violation in order to get a site taken offline.

The Internet Archive does turn a blind eye when it comes to pro-actively moderating uploads, but they're not required to do that. They do follow takedown requests as the law proscribes (including taking down lots of stuff that is legal and really shouldn't be taken down, because the takedown laws have no exceptions for it).

The Internet Archive tries to push boundaries sometimes - all corporations do. IA having a link to "software" and then not pro-actively moderating that section is like Uber not getting medallions for its taxi driver employees who it calls contractors. It's not the same as the flagrant disobedience from archive.today. IA did flagrantly disobey one time, and it almost catastrophically deleted them from existence, to the detriment of everyone.