Hacker News new | ask | show | jobs
by badlibrarian 501 days ago
Many of these sites are already captured and archived by proper entities as required by federal law. More is better, I guess, except when it isn't. Duplication of effort is a huge problem in the humanities in general and with archiving in particular.

The whole concept needs to be rethought. Captures from these tools show up under "ArchiveTeam" which is currently pumping thousands of copies of the Google Home Page into the Wayback Machine every week. Or at least trying to.

https://web.archive.org/web/20250122000033/www.google.com

Like so many things about archive.org, when you dig in you start to find wonder and craziness at every turn.

2 comments

> by proper entities as required by federal law.

What federal law do you suppose is guiding the mass deletions? That doesn't look like archiving to me. Now that the foxes are running the henhouse, how reliable do you suppose their own archives are?

Some of the mass deletions are merely a new administration setting up shop. Policies from the previous administration don't belong on the current whitehouse.gov. They wind up here instead https://bidenwhitehouse.archives.gov/

We pay half a billion in tax dollars for the National Archives, and nearly a billion to the Library of Congress to preserve these records. Others are managed as part of Presidential Libraries.

Thousands of employees, dozens of facilities, billions of dollars.

Meanwhile archive.org doesn't have air conditioning and preserves physical material within the blast radius of an oil refinery. They let vagrants sleep on their steps yet seem surprised when they set the utility pole outsides on fire.

I didn't say it didn't need to be done. I said the whole process needs to be rethought with professional supervision. Setting up more volunteer K8 clusters so that more copies of the Google Home Page can be captured with the wrong user agent isn't going to save democracy.

Archive.org is outside of the reach of the US government, and is globally distributed. When the US government deletes or darks data (as it has recently done across wide swaths of the federal government website properties), you have no recourse. This means your argument about the resources that go into the US government as a data custodian are meaningless: the outcome is what is material, which is the archival and long term custody & availability of the data sets in scope. Arguably, the Internet Archive has recently proven better at this job than the US government (unsurprising).

You're angry at a high value non profit operating on a limited budget. It's weird. I recommend focusing on more important issues than "it is icky around the richmond facility, the power goes out once in a while, and they use ambient air and convection for system cooling which I don't like."

If you want to save democracy, the Internet Archive doesn't do that itself. It protects the historical record. If you want to save democracy, that's a different conversation.

https://blog.archive.org/2024/05/08/end-of-term-web-archive/

https://web.archive.org/collection-search/EndOfTerm2024PreEl...

(no affiliation)

I would classify the end of term web archive (which archive.org is, in its typical fashion, taking far too much credit for) as an example of entities doing things right.

https://eotarchive.org/partners/

And saying "archive.org is outside the reach of the US government" -- hell, it's not even outside the reach of the RIAA or the book company with the little penguin on the cover.

We should have proper supervised federal archiving and archive.org should be far better run, too.

And I don't know what Archive Team is but maybe they could update their site to provide some information on the people involved. And perhaps update their understanding of what's possible with docker containers while they're at it.

Because the counterpoint to a radicalized Musk screwing around with government databases isn't an opposing group of anonymous radicals screwing around with commercial databases.

I'm interested in why you are saying that the Internet Archive is taking too much credit for the end of term web archive. The website you link to demonstrates that it's run by the Internet Archive, although various partners have joined it since it began.

Is that not correct?

> And I don't know what Archive Team is but maybe they could update their site to provide some information on the people involved.

You don't need to reveal your identity, but looking through your comments, it looks like you originally spun up this account to criticize the Internet Archive. I'll just note that accusing others of being "anonymous radicals" falls a little flatter when you're anonymous yourself.

(Relevant disclosure: I've worked with IA and Brewster Kahle, and defended him here before.)

> Is that not correct?

It's not run by the Archive. It's a collaboration. They didn't even do all the crawling, and the Library of Congress keeps a copy.

https://eotarchive.org/about/

As for Archive Team, their site declares "Archive Team is a loose collective of rogue archivists, programmers, writers and loudmouths."

Dedication is great. And radicalization in response to copyright and preservation certainly deserves some leeway. But a little professionalism wouldn't hurt and the 2600-era roleplay isn't fooling anyone.

Agree the US government should contribute in some capacity. Agree they should be robustly funded to do this. But, checks and balances are also important, and when a node goes rouge or dark, the system must be fault tolerant and operate when degradation occurs. I previously said "I trust Brewster and the rest of the IA gang more than the US government to safeguard the Internet Archive." [1] I feel this assertion has been proven out over the last few weeks.

ArchiveTeam stands on its own as an independent, community driven volunteer digital archival and preservation effort. If you don't understand why, what, and how they operate, look closer and be more curious [2].

[1] https://news.ycombinator.com/item?id=41984664

[2] https://en.wikipedia.org/wiki/Wikipedia:Chesterton%27s_fence

If the checks and balances of NARA and LOC (6,000 employees, $1.5 billion in annual funding) is Brewster Kahle asking for $10 on pages serving pirated Nintendo games, then we're in a bit of trouble, aren't we?
> They let vagrants sleep on their steps yet seem surprised when they set the utility pole outsides on fire.

Tbf I have let many people sleep on my doorstep and none of them tried to set my building on fire. One of them even sang for me; he had a killer baritone. Overall it seems like a fairly harmless thing.

I wasn't speaking metaphorically. Fire set to pole. Site went down.
You really do live up to your name!

You imply that archive.org is somehow doing something wrong by letting "vagrants" sleep on their steps. I'd assert that people who are compassionate are more trustworthy than people who think punishing others should be normalized. I'd definitely prefer my backups in the hands of compassionate people.

The problem is that the people who want to see others be punished can't be trusted to, you know, not do that. Removing information about climate change, about vaccines, about trans care, et cetera, very well could happen at the hands of those who get off on punishing others.

You say the National Archives already does this. What happens when the current administration fires everyone and replaces them with non-professionals?

So I really don't know why you'd be in here talking ish about ArchiveTeam.

> I'd definitely prefer my backups in the hands of compassionate people.

I prefer them in the hands of competent people, in a building with climate control.

Heard about the time these compassionate folks tried to run a bank and got shut down in the Obama era?

> Unwillingness to open accounts within the field of membership, make loans, and establish operations in the low-income community where the credit union was chartered to serve

https://ncua.gov/newsroom/press-release/2016/internet-archiv...

How do I as a non-US citizen get access to information from those "proper entities"? Is it even possible for US citizens? This is often a surprise for some visitors of this fine website, but there's a large world outside the US where "federal law" does not apply.
We fund the Library of Congress (largest library in the world) and the National Archives (NARA) who make all of this stuff public. Other goverments do similar things. It's all on the web.

https://www.archives.gov/presidential-records/research/archi...

There are other agencies and data sources to be monitored of course but I'm not seeing a lot of nuance in those efforts yet.