Hacker News new | ask | show | jobs
by starsinspace 2376 days ago
Although efforts like internet archive are noble (and I find it occasionally useful), I'm not sure it's always so great that everything anyone does online will be permanently archived.

I know many people feel that everything should be available forever. But for me... it's pushing me away from doing much on the web. I liked it in the 90s when things were more ephemeral. When you could make mistakes and not have them easily found by anyone with a few clicks, forever.

17 comments

You're right, let's burn the library down because one book has a liable chapter in it.

This argument is so horrible as to be actively harmful to Archive's work. Jason Scott is a god, and if we didn't have him, we'd have to invent him.

WE DO NOT GET TO CHOOSE WHAT THE FUTURE FINDS INTERESTING.

We live in the only point in human history where we can actually save all of humanity's knowledge and culture, and we can do so without having to worry about physical space or staff to work the "library." It's a remarkable time we live in, and yet, 99% of our society either doesn't care, thinks this work is stupid, or actively works against it through horrific copyright laws.

We know more about how Rembrandt painted and lived than we do about how Atari 2600 programmers worked and lived. I can go to Rembrandt's house and see where he lived, where he painted, how he worked, where he slept and ate and mixed his paints and taught his classes.

Atari's old HQ is just another office building. The source code to those games is mostly gone (thankfully, it's assembly and easier to disassemble). We need to save our culture and digital heritage, else we forget where we come from.

Deleting some old tweets is one thing, but actively worrying about Archive's work is just harmful to us all. We need 10,000 more Archives, dammit. It's supremely important work that is helping stem the tide of lost culture due to stock market forces. Geocities is gone forever because Yahoo! didn't find it profitable. This cannot keep happening.

I’m not convinced it’s dangerous to explore whether there are benefits to ephemerality.

I’m also not sure your Rembrandt example shows what you suggest it does. The average Atari 2600 programmer would be more equivalent to the hundreds of now unknown artists in Rembrandt’s time. The John Carmack’s of today will be remembered in detail with or without blanket archive efforts.

Maybe, just maybe, Rembrandt’s Status in our minds is a result of generations of people each seeing the individual value in his work. That is, each generation does indeed get to decide what future generations remember. Or at least it used to be true until the digital age.

Maybe the change is an improvement. But maybe not.

And libraries are the epitome of what you’re fighting against. They are by definition works chosen by humans based on judgment calls of their perceived value.

Let’s at least acknowledge that blanket archive efforts are a fundamental change in themselves and a departure from the human status quo for thousands of years. Then let’s debate whether the change is an unabated good.

While I don't endorse your parent's over-the-top rhetoric, and I do agree that there is value in ephemerality and that it's worth noting that libraries are more carefully curated than a dump-and-archive, I think it's also worth noting that these are generally public pages.

All the stuff Tumblr users intentionally wrote and published publicly, but none of their IP address logs and other incidentally collected information, is exactly what ought to be archived and preserved, in my opinion. This is in strong contrast to incidentally collected data including clear PII like IP addresses that many companies today are hoarding forever, when they ought to be ephemeral.

Tumblr blogs often include people's names, faces, and/or details about their personal lives. That's very much personally identifiable information! And while they did post it publicly, they likely didn't do so with the intention of it being saved forever in a publicly available, easily searchable archive. This especially applies for porn blogs where people post their own original content.

There's certainly value in archiving social media but I think it has to be balanced against the harms, instead of defending the practice with literal religious fervor and dismissing all criticism out of hand.

Was there something in particular I said that you felt was defending it with "literal religious fervor and dismissing all criticism out of hand", or were you referring to my grandparent? I don't think I dismissed anything out of hand, I specifically acknowledged both the value of ephemerality and the point that traditional libraries are curated.

I agree that there is a danger that people may not realize how public and permanent the things they published to Tumblr were, or how dangerous it can be to do so (and I downvoted a sibling comment dismissing this danger). However, I think you and I have different threat models.

In my mind, archiving PII that is intentionally published is not particularly harmful because most lay people do, in fact, understand that their avatar, username, and by default, posts are public on Tumblr. They have had the opportunity to remove that information this whole time, and they still do, Archive.org removes stuff if you ask them.

By contrast, lay people have no mental model for what kind of information is incidentally collected nor how dangerous or benign it is. Certainly, lay people also can and do misjudge how public and how dangerous the things they intentionally publish are, but the gap is far, far less than incidental information. "Would you tell a stranger this" or "would you write this on a bathroom wall" are decent heuristics: the only difference in danger between text written on a bathroom wall and written on Tumblr is due solely to the potentially wider reach and possibility of even going viral on Tumblr. (Photos, of course, can also subtly compromise privacy in ways surprising to a lay person, but the gap is still much smaller than incidental information.)

In my threat model, that gap in understanding is much, much more dangerous than the intrinsic danger of PII. That's why I think that as long as Archive.org has a usable removal process, I think pretty much all the danger is in surveillance capitalism's collection of incidental information, not Archive.org's permanent record of intentionally publicized information.

The reason we fight against censorship (which is what this debate comes down to) with literal religious fervor is because that's how the other side fights for it.

Don't want it archived forever? Don't put it on the Internet. Seems simple enough.

If Archive.org had your attitude, I would actively oppose it. Removing private, personal info is not censorship. And nothing about "just don't put it on the Internet" is simple. What if someone hacked your devices and then put it on the Internet for lolz? What if you shared it in confidence with someone you trusted, who is intentionally putting it on the Internet to hurt you? What if you accidentally pasted the wrong thing or uploaded the wrong file? What if you were a child and didn't understand the dangers?

There obviously should be ways to ameliorate your mistake, which is why it is absolutely critical that Archive.org has a removal process.

Many people writing personal diaries/letters probably didn't do so with the intention of it being saved forever in a publicly available, easily searchable archive.

Yet such data is invaluable to historians and can give us a window in time through the eyes of people who lived that time. Having that publicly available data lost for all time would be an immense loss to future generations.

I'm sure in a few generations, some historians will study those archived porn blogs and get an insight on the evolution of humans' relations to sexuality that today's historians can only dream of.

Ironically, IP addresses are probably the _least_ personally identifiable bit of information in a lot of that stuff. Most people's IPs are assigned to someone else within months, or even hours. But a username, profile picture, etc? Those are potentially identifiable.
In a reply to your sibling I explain how in my view, the fact that lay people have no mental model of what kind of information can be incidentally collected and how dangerous it is, whereas lay people are much more capable of understanding the dangers of a personally identifiable username, profile pic, and personal details revealed in posts, makes the former far more dangerous than the latter.
> The John Carmack’s of today will be remembered in detail with or without blanket archive efforts.

Sure, but this leaves us a distorted view of history, where we have lots of details on the lives of "great men" and next to none on how ordinary people lived. Which means the vast majority of people who lived and died in that period end up written out of their own history.

Archaeologists spend a lot of time rooting around in ancient rubbish piles and cesspools, because these are some of the very few places where physical evidence of how ordinary people lived has survived. Nobody in ancient times would have nominated those sites as culturally important or worthy of preservation. But what we know of how ordinary people in those times worked, played, ate and drank comes largely from things dug up from them.

I certainly am sympathetic to the preservationist mindset. OTOH, even if we restrict ourselves to content that is natively created in digital form, the amount of "stuff" that comes into existence every day--much of it not on the public web or public social media--is staggering. (And much is not public for good reasons.)

I'm not convinced that we should feel a compulsion to save all of that. Just because it's more practical to be a pack rat about digital content doesn't mean that, taken to extremes, it doesn't still seem like being a pack rat.

In 2008 I found a parcel of bare EPROMs at a flea market container 27 games. 1 of those games was Cabbage Patch Kids Adventures in the Park, and it was spread across 12 chips, each one showing a progressive state of development across 9 months.

To my mind, this was the only known find of a vintage Atari 2600 game and its iterative development process. So, 30 years later, the only reason we had this snapshot is because someone found these chips and sold them at the flea.

The current state of digital preservation is abhorrent. Those roms would have taken up less than 1/4 of a 5.25" floppy, but the company behind them never thought to preserve that information or data.

Take2 Interactive republished BioShock in 2012. They couldn't find their source code. They didn't save it. They had to go machine to machine looking for it. The reissued game is not the same as the original.

As a society, we don't place any value on this stuff, but the potential value of it cannot be understood until the future has occurred. Letting it vanish is a disservice to the future. In the past, if a book was published, it wasn't going to vanish if the publisher went out of business, there would simply be no new copies.

In our digital online age, things vanish in seconds, days and hours. This is also a very different state of affairs. In the past we could not save everything, but everything didn't have a clock counting down from the end of the quarter over its head, counting the seconds until it is deleted.

The Library of Congress tries to save everything. Yes, libraries weed the stacks and choose items to host. This is due to space concerns: they can't host everything ever. Digitally, they can, and many host reams of microfilm and old newspapers because they can.

Libraries can, thanks to tech, now host every book ever, digitally, for very low costs. Copyright prevents that.

This is an unabated good. Leaving things behind and forgetting them is how you get Tulsa Oaklahoma, or the Armenian Genocide denials. We don't get to choose what the future finds interesting, and for the first time in history, we do not have to. Why in the every loving fuck would you worry about that?

Most likely, only for personal reasons. This is a humanity level problem. Your personal worries are irrelevant in 100 years when everyone who ever knew you is dead anyway. Geocities would be more interesting at that time, as a subject of study.

Library of Congress, British Library, Bibliothèque Nationale etc choose to save everything they are mandated to, and a fair bit extra besides. That includes everything published. They don't save their water cooler chats, personal letters and everything sent by post, everything said on the phone or Facebook, etc.

The bar - perhaps found accidentally - seems quite important in deciding what must be archived, and what probably shouldn't.

Archives of personal letters and ephemera, preserved in manuscript/special collections libraries, are incredibly important research sources. This often includes letters which were never meant to published. LOC had a project to preserve every tweet (published to the world) until a few years ago - who knows what tweets might be useful to future researchers?
And yet, hundreds of years later historians and linguists crave for letters, and post, and telegrams to get a glimpse of actual life outside official publications.
Sure, and a hundred or more years later the family of the author, or relatives of the recipient can decide to release the family letters or telegram from WW1 or the US Civil War etc. That delay, usually at least until the correspondents have died, is important. The affair, the less than ideal belief, and all that other imperfect demonstration of humanity can no longer hurt or embarrass. It ceases to be private and personal and moves into the historic.

Releasing whilst the probably famous sender is alive is most often in the realms of to do damage, simply tasteless or paid for revelations in the gutter press.

Are these EPROMs archived or available to play anywhere? You've got me curious!

(amateur digital archivist and data recovery hw/sw dev here, I find this stuff fascinating!)

> Leaving things behind and forgetting them is how you get Tulsa Oaklahoma, or the Armenian Genocide denials. We don't get to choose what the future finds interesting, and for the first time in history, we do not have to.

There is plenty of evidence for the Armenian genocide, the Holocaust, and 9/11. That doesn’t really stop deniers or conspiracy theorists. When it becomes politically advantageous, spreading misinformation becomes weaponized and mainstream. A bunch of nerds saving some ROM dumps isn’t going to really change that.

Like the library of Alexandria it’s also quite idealist to think archive.org will be around in 100 years or more. Not that we shouldn’t do it... but the future can be unkind to even all modern technology.

> We live in the only point in human history where we can actually save all of humanity's knowledge and culture,

Because the the welter of proprietary, undocumented formats, media bitrot and the like we are actually moving away from such a point.

Turns out historians may not be so upset. You can be a historian of early medieval France have a chance of reading 100% of the surviving documentation. Too much data can obscure the story.

Of course historically you could get a PhD for compiling a concordance to Shakespeare, something that can now be done mechanically in seconds. Future historians could (and will) apply the same tools to today's surviving documentation. But I don't believe there'll be as much of it as you seem to think.

>Because the the welter of proprietary, undocumented formats, media bitrot and the like we are actually moving away from such a point.

The best we can probably say is that it's different. We're capable of saving far more but, in practice, a lot of digital media is locked up in walled gardens and accounts that have to be paid for and require logins.

It's presumably easier to save a bunch of photographs or videos in a way that they'll be accessible so long as key Internet sites or their successors are. A fire or flood probably won't destroy them. OTOH, unless you've taken affirmative steps to upload that media to the right place, it won't be serendipitously discovered in a shoebox some day in the future.

> Turns out historians may not be so upset. You can be a historian of early medieval France have a chance of reading 100% of the surviving documentation. Too much data can obscure the story.

I don't understand this reasoning. Yes, more data = more work, but less data = more likely you're wrong.

Less data: less likelihood someone will challenge you before you get tenure.

The incentives in academia are messed up.

Is your entire original comment meant to be read as sarcasm then?

Or are you advocating for the idea that history is arbitrary and it’s better to just have a simple story than to have to worry about what really happened?

I’m having a hard time understanding what idea you’re trying to position in this debate.

I was making three points in the three paragraphs:

1 - it's more likely we will be an information-sparse region in the historical record rather than an information-dense region.

2 - professional historians have their own set of incentives which can be counterintuitive to the layperson.

3 - but indeed if there turns out to be a huge amount of stuff (there will likely be mountains of some forms of ephemera) to go through some people may be able to find value using new tools not available in the past to historians.

As someone trained as (but never worked as) a historian I do indeed have a bit of cynicism on point 2. I suspect most if not all actually working in that domain have the same cynicism.

Somehow "all available documentation about medieval France" somehow ended up being "oh, we don't know exactly, a lot is guesswork".

And that's France. What about Medieval Africa? The Americas?

Oh, we mostly burned that documentation because "it wasn't important". Right?

My understanding is that the jesuits burned the mesoamerican literature because they thought it was dangerous to their reign, not harmless. An appalling crime.
> You're right, let's burn the library down because one book has a liable chapter in it.

Jeez... You may want to re-read my comment. I have written no such thing, and it is not my opinion at all.

This comment would be better without the first three paragraphs, which actively detract from what you're trying to say.
> You're right, let's burn the library down because one book has a liable chapter in it.

I feel you got the comment backwards: a better analogy would be "if a used-books store full of Dan Browns were to burn down, would we regret the loss of maybe one chapter that has some value?"

Your position seems to be "yes", but I wouldn't dismiss so easily the opposite view: that 90% of everything is crap, and that keeping everything forever "just in case" sounds surprisingly similar to hoarding.

I do not oppose "purposeful archiving" - as someone mentioned, saving outgoing Wikipedia links seems smart. But my old twitter account, where I kept track of missed trains? There are better sources for that, and no one missed it when it was gone.

It's almost impossible to evaluate what is of lasting value in the moment, while it is readily available.

Imagine an author writes a paperback. It isn't very good, but a few people read it. Later, one of those people goes on to rework some of those ideas into their own script for a film. The film is a success. Years later, the scriptwriter mentions the paperback as an inspiration while giving an interview, but it's long out of print.

To a biographer or a devoted fan of the film, this forgotten book, while of little value in and of itself has become a valuable part of a larger story. If it were culled when the contents of that used book store burned down, we would have lost something without realizing it. And that's how we lose most things. The only way to minimize this is to store as much as we can, in the hopes that we may find a use someday, and thankfully digital storage has made this very, very cheap. The opportunity cost is tiny, and the potential reward, given enough time, is unbounded.

But the opportunity cost is not tiny. This is literally a twitter thread asking for financial support.

And I do recognize that the thousands of petabytes will likely be chump change to store in a decade... but necessarily the economies of storage will keep pace with the rate of content production. It will always be expensive to store everything.

The question is, do we gain back this investment from future uses of these archives? I dunno. I’d be interested to hear what value archivists have gotten out of the archives, given it is decently old already.

Like with other "90% of X is wasted" sayings, you don't know which 90% it is.

Even if you look at classical art with an honest eye, you can find plenty of works that in themselves are, well, crap - but they're being preserved and reproduced and talked about, because they acquired meaning over time. They've become relevant in context.

Take your old Twitter account. It's probably not interesting. It probably won't ever be. But it might. Imagine several decades from now, your great-granddaughter becomes a well-known, influential politician. This might retroactively and posthumously make you relevant, and in the process your Twitter account. Biographists might find it useful. Or independently, people who're into historical train schedules. Etc.

It's near-impossible to predict what the future will find relevant, so if storing some memories is nearly free on the margin - as it is today, with digital technologies - then just storing it is a no-brainer.

I think a better analogy than a library would be your average day in the office: would you want everything you say and do in the office recorded for eternity? Sure it would help, say, catch fraudsters, track responsibility and credit, allow sociologists fascinating analysis - but is that worth it? The >1GB of Google+ is a good example. Probably many interesting posts from people that are the core experts on topic X - and many nonsensical Twitter-like posts of people sharing whatever they encountered or thought that day.
>> We live in the only point in human history where we can actually save all of humanity's knowledge and culture

Playing devils advocate here for a moment. . .

Considering we as humans learn little from our past, keeping all of this knowledge is a benefit to whom then? Some people who feel nostalgic about Sony's first walkman? Or maybe people using it for nefarious reasons? If humans continue to make the same historical mistakes over and over, what benefit does the human race gain from cataloging all this information? I would venture to guess, its more plausible it will be used against us instead of furthering our own culture.

>> We know more about how Rembrandt painted and lived than we do about how Atari 2600 programmers worked and lived

There is a huge difference between saving all of Rembrandt's stuff than it is some 22 year old college drop out programmer who created a video game in the hey days of long forgotten startup company. And yeah, there have been numerous documentaries, and articles written about Atari in those early days. Who would want to save a dilapidated roller rink under the auspices that a great and noble video game company used it as their HQ for a few years??

https://www.polygon.com/2018/7/6/17542154/atari-book-valley-...

But then this roller rink down the block became available: 10,000 square feet! I mean, we were just jam-packed, and we had people on roller skates actually running around on the roller-skate rink building Pongs.

While I do think leaving certain things to the sands of time is a good thing, vacuuming up everything is just as worrisome. Are we going to be hoarders of a bygone technological past where a large majority of the "stuff" we save will have little, if any use to anybody anymore??

Having a background in anthropology, I find it fascinating there will be many generations of kids who leave no physical trace of their existence since a large majority will be in electronic form. Just imagine how people's lives are in a sort of suspended animation after passing away and having their Facebook pages live on forever.

It's a bit hard to wrap your head around tbh.

I would say there are several classifications of things worth saving through a broad net:

- kindling sources, like a LiveJournal post that inspired Lin Manuel to write Hamilton (for a fake example)

- early work of a future star, like imagine Lorde posted early songs to MySpace. This is already a clear issue as many posted songs have been deleted or lost for various reasons.

- valuable things on shaky ground. Yahoo Groups, for the latest example. But I just saw on Reddit someone was looking for a deleted scene from Blair Witch Project that was supposedly the first video ever published on Amazon Prime Video... and now it has nearly vanished. That seems crazy to me from so many directions.

- the value of the ephemeral. Gold and jewelry from old civilizations is nice but we know so much of how people actually lived by examining their garbage, scrap notes, broken bowls, etc.

- the myth of permanence. We feel like 10 million people see a video, it is probably preserved. But there are no master tapes of any of this and so much of everything is interlinked and hard to piece together after the fact. What were people's tech stack when they were making MySpace? How big were people's hard drives? Did rhey share sonngs theough Kazaa or play them on MySpace directly? What was the state of Javascript then, what were the security issues or underground trends? How did songs propogate, where were they shared? Were people sending links in email or AIM, were people sharing links on Digg? This is stuff from like a decade or two and already you need to think like an archaeologist to have any sense of how the culture really existed because there were so many moving parts from year to year.

- the value of datasets. Imagine putting some thought against the Geocities archive to see how HTML blink tags grew then fell in popularity over time. Or how a meme propagated, or analyze the link structure between groups of people or by topic or any make any number of interesting inquiries about how humans operate culturally in digital space and how interact socially through certain set of tools and limitations. There are very interesting possibilities here for understanding ourselves better as a species.

>You're right, let's burn the library down because one book has a liable chapter in it.

It is more like, either you burn the library down or every thing you have written in your private journal is now available to be checked out by anyone.

It really shouldn't be that way and I think we should fix the problem of holding people responsible for bad behavior in the past. But how do we draw lines (for example, what about holding people responsible for past crimes).

>We need to save our culture and digital heritage, else we forget where we come from.

I agree, but we also need to ensure this is done without costing individuals. Technology has advanced, but society has not. Out technology outpacing our culture has and will continue to hurt many people and we should try to find a way to fix it.

It is appalling to me that the parent comment is being downvoted. Religious fervor indeed. The point being made is simply that saving every scrap of history including personal tracking and details that are normally LOST to history, is a sea-change in human history and shouldn't be looked at lightly.

There is value in forgetting. What we forget, then is a very relevant question. "NEVER FORGET ANYTHING RAWR!" is not a useful point of view because it denies the very right to have a conversation on the subject.

If you can't agree that I should have some say in what is remembered (or at least archived) about me or generated by me, there's not much we can talk about.

It is also a matter of consent.

And to make it even clearer, one just needs to think about leaked images. Should we not allow a person to delete such images leaked without their consent?

1,000 years from now archaeologist may have some academic interest and those involved and even generations of their descendants are long dead.

But what about 1 year from now? Benefiting those 1000 years from now as the cost of those alive today is a hard position to justify, especially with such a blanket justification.

The Circle (both book and film) was, sadly, a largely botched attempt to explore issues like privacy, the power of tech companies, widespread surveillance, etc.

Missed opportunity in that the book was only readable if you took it as a deliberately over the top "if this goes on" fable. And the film was mostly notable for how it squandered a top-notch cast.

> Atari's old HQ is just another office building. The source code to those games is mostly gone (thankfully, it's assembly and easier to disassemble). We need to save our culture and digital heritage, else we forget where we come from.

Very good point!!!

In my view the Internet Archive should be the Digital equivalent of the the role of the National Register of Historic Places (NRHP). Shepherds of documentation, to give it a cool-ish sounding name.

My personal, obscure ISP user page (think the ~user/ era) from 1995 is preserved in all it's drop shadow blink tag marquee glory at archive.org with me doing nothing, it was just captured by whatever natural processes. The things I said on mailing lists, random forum posts etc. - it's all archived. That 90s stuff isn't/wasn't as ephemeral as folks think in my opinion, it's out there somewhere. $0.02 :)
> I'm not sure it's always so great that everything anyone does online will be permanently archived.

But you see, even if the Internet Archive didn't exist, someone would probably still be saving a copy of the things you do. It'd just be a megacorp or surveillance agency instead of a more egalitarian organization.

So the choice isn't "things on the internet are ephemeral" or "things on the internet are available forever to everyone", it's that or "things on the internet are available forever to some subset of the rich and powerful".

Maybe if everything would be archived forever, we could understand that everybody makes mistakes, and stop paying so much attention to old posts? Though I admit this is very optimistic view on human behavior.
Ancient posts rarely do get any attention though unless you are a politician and even then most people agree its worthless information. There was the recent event with the guy from Canada having photos of him wearing blackface almost 20 years ago and most people agreed that something so long ago is totally irrelevant to today.
Some mistakes are also worth recording. I like seeing bad predictions of the 2000s from the 1970s for example.

Not to mention that quite a lot what is archived today has been made by companies, there's no "right to be forgotten" that companies could ever deserve. For example I've uncovered quite a few mistakes in currently public datasets/websites based on archived sites, who knows how many mistakes are made now and never fixed because we lose the original sources. Point being that the lack of original source doesn't mean the information gets lost, it just becomes a big version of the kids game "telephone" where everyone recites what they heard and it gets distorted in the end.

I'm sure Ea-nasir would agree with you, but I doubt Nanni would. (https://en.wikipedia.org/wiki/Complaint_tablet_to_Ea-nasir)
I think it’s fantastic that almost 4000 years later, people still learn that Ea-nasir was a cheating bastard.
>I'm not sure it's always so great that everything anyone does online will be permanently archived

The real problem here is the runaway cancel culture, where we attack people for things they said or did years or decades ago which were (at the time) perfectly acceptable and reasonable.

The most egregious example I have seen so far is cancel culture advocates who think we should disregard the late Richard Feynman’s legacy because he said some rude things to a lady back in 1946, even though the lady herself was not offended, since she did sleep with him later that same evening.

There’s a point where we just have to say “That was a long time ago, no one at the time was offended, get over it.”

Indeed. Context matters, and societies evolve over time. Opinions which we'd consider abhorrent today were, once upon a time, may have been acceptable.

A comment made years ago is only a reflection of a person's opinions at that point in time; opinions which may have changed since.

The consolidation and permanence of the web are definitely concerning.

Moving from "somebody knows this happened" or "this is in a file drawer somewhere" to "there's a searchable record of this" expand everyone's access to the info, and can do a lot to stave off forgetfulness and bit rot. But the people whose gain the most access are the ones who weren't involved in the first place, and the intersection of "uninvolved" and "cares enough to check" tends to be people who are actively hostile. Hence doxxing, stolen photos, and callouts over years-old tweets.

But that's a broad result of digitization. If a reporter or opposition researcher wants to embarrass someone, they can already look through digitized student newspaper essays, find interview subjects off class rolls, or simply comb through Twitter for long-forgotten offenses. (This holds for both good and ill - it applies to both serious skeletons and misleading or trivial issues.)

The Internet Archive, then, seems like sousveillance offsetting surveillance. For those who can point time, money, and connections at a target, it's enough that evidence exists, and more than enough that it's available online. But for the general public, it's much harder to keep track of countless sources or publicize news. If you can't dedicate interns and an archive to tracking every news story you read, you can't find or prove edits. (And while most newspapers noted corrections or morning/evening revisions, silently changing online stories has become common practice even for the likes of the BBC.) If you can't point out a webpage or tweet to thousands of people at once, the evidence is likely to be taken down before it's recognized. There are a lot of dedicated sites like NewsDiffs working on this problem, but Internet Archive provides a general-purpose answer to "let an average person see the history of a page or create a trusted record of it".

I worry that this just amounts to an eye for an eye, and still increases the total amount of scrutiny we're all under. But as long as more content is becoming permanent, it still seems better to have symmetrical access to it.

It's not actually true that everything anyone does online will be permanently archived. If it were, there would be no need for the Internet Archive.

The truth is, only the things someone has an interest in archiving will be archived, and only so long as someone has an interest in maintaining those archives. Just look at the recent announcement about Yahoo Groups... no one was, and likely no one is, going to permanently archive most of that. Sites, content and history get lost all the time.

Here's a great talk about the real ramifications of this sort of problem:

https://tararobertson.ca/2016/lita-keynote/

I think it would be reasonable to establish a bar, similar to offline where everything above it is archived, and everything below is optional opt-in.

In the offline world the National Libraries get a copy of every book, magazine and newspaper published, by law. At least that's the way the UK and US do it. They archive a lot of other stuff as well, including music, audio, adverts, but that's more informal, and there is no requirement to preserve.

Personally I'd like things politicians and personalities (by dint of having chosen to live large) say online archived, all business (to later hold them to account) along with the sites of anyone in the business of influence - think tanks, parties, lobbyists, activists, "grass roots" organisations etc. Individuals, anon forums, HN and reddit subs and other places of shooting the breeze should be allowed to stay ephemeral. In fact I think conversation is freer that way - some will choose to say less, say different, or say nothing if all everyone says is forever...

In the US, mandatory deposit technically applies to any copyrighted work of any kind. We should fund LoC to enforce mandatory deposit on digitally published works as well.

In a sense this is also a good demarcation point. If something is serious enough to be worthy of copyright protection, it's probably worth archiving.

Funny you mention HN and Reddit as ephemeral because I always thought they were more permanent than most. While you can email the mods and ask that a particular post of yours be removed, I don’t think they will wholesale scrub your content out of the archives if you request, or help to anonymize them in any way.

I consider HN pretty much permanent and tread carefully with controversial opinions or things that might one day be considered not-PC.

What law in The US? In ancient times, like 40+ yrs ago, before the first big extension / copyright automatically granted at moment of creation. It used to be req to send copy to LoC to earn right to enforce copyright.

I've published several books the LoC does not have.

Also the national libraries aren't the sole archives of culture. Univ and private libs preserve all the important stuff government has not the interest or budget for.

To my surprise, I learned a few months back that mandatory deposit [1] is still actually a thing, albeit a completely unenforced thing. (I believe deposit is needed if you want to sue for damages, but in theory you're supposed to deposit in any case.)

[1] https://www.copyright.gov/help/faq/mandatory_deposit.html

Not only that, but I also wonder if we're overestimating the value of keeping all of this data around. Who's going to have the time to search and curate these mountains of information when we're generating tons more of it every day? I imagine the ideal goal is to allow future historians to learn about our past selves, but I think there's a tipping point where only those with lots of resources can afford to meaningfully consume it. Those typically are wealthy companies or individuals, and I'm generally less excited about what do with our information.

Obviously there's value in archiving some information, but a save all or even same most approach starts sounding a little hoarder-ish. Sure you might one day make use of that 1997 November TV guide, but chances are you won't and in the meantime you're paying the opportunity cost of storing it.

Maybe we need to take a page from Marie Kondo and only keep that which sparks joy and learn to let go of the rest. There's a chance someone will need a bit of info that no longer exists, but we'll probably be ok.

Part of the challenge here is that it's hard to know in advance what is or isn't worth archiving. It may only be clear a few years later that some big chunk of now-dead data was important.

In that sense, curating all of it doesn't really matter as long as you archived it. Someone trying to find the data later (or curate it!) can find their way to the right URLs using other sources, and then begin the process of curating this archived data after-the-fact.

The internet archive is most useful for when you click a link and it is dead which is very often. The wikipedia references are filled with dead links which now point to IA.

There is probably a lot of junk data on IA though especially video site archives but its worth keeping stuff that isn't needed if it means keeping stuff that was useful.

> Who's going to have the time to search and curate these mountains of information when we're generating tons more of it every day?

Presumably some sort of search engine, not a person.

Well, there are some tools that have been developed that have pretty amazing capabilities to crunch through staggering quantities of data and come up with useful insights. It's basically big g's core competency, and there are tons of other companies that do the same thing, as well as open-source solutions that can be used.
In the not so distant future many people will record their entire lives: movements, utterances, biometrics, audiovisual and sensory data. Then they are going to freakout when dead people's lives start getting deleted because nobody is going to pay to host all this crap
I was having the same thoughts. Most of what I've used Archive for is to look up e.g. old blog posts for personalities that show their hypocrisy compared to today, for example. Or someone posted something daft when they were 15 and their handle was leaked and now it's out there forever, and we can laugh at how stupid they were.

I'm sure glad I went to efforts to scrub my personal sites I made when I was a teenager!

I don't think all blogs and personal content(for the lack of a better word) should just be archived. You should need consent. Most people have no idea it's going on. Or it should be very easy to delete something from the archive.

I'm convinced that this instinct to preserve everything forever is psychologically connected with the the denial of mortality. (Edit: I'm not saying this is a bad thing, just suggesting the phenomena may be connected.)
I think it’s actually simply evolution at work. That’s part of our evolutionary process as humans, building on historical achievements of our ancestors.
Indeed, since the psychological phenomenon of the denial of mortality is the result of evolution at work.
You might be on the right track. We're talking about an organization that keeps a church room full of ceramic sculptures of it's employees.

https://www.businessinsider.com/the-internet-archives-100-ce...

It's good to note that there's a difference between archive and (big A) Archive, as a practice and discipline. As far as I can tell, Archivists (like, people who went to school to be an Archivist), don't really agree with Jason Scott's agenda and approach.
On one hand, sure, library science and forensic analysis are extremely important, and nothing lasts forever, especially without the care of curators. We aren't dismissing traditional nor classical archival methods, and they already have taught us much about how to do digital archiving. [0]

On the other hand, clearly the Internet Archive is a competent digital archiver, and they've earned the capital-A "Archive". They publish a larger digital commons than anybody else, I think, especially at the low low price of gratis.

It sounds like your entire complaint is in two points. First, that IA doesn't ask (much) permission, which is unsurprising. The history of libraries is not one of asking permission, but of simply doing it. The public has been convinced repeatedly, over the decades, that libraries are good for them, and this public support helps insulate librarians from corporate interests.

Second, that IA doesn't employ enough women. I can't help you with that, but you are free to improve yourself.

[0] https://en.wikipedia.org/wiki/Disc_rot

Why is going to school for a subject a proxy for competency? Jason works for the Internet Archive. Perhaps "archivists" might consider more real world experience versus academic exercises?
Archives aren't new. It's not like software engineering, where you can cowboy shit and and blaze new trails without giving thought to what others in the past have done.

Not to mention that Archives is historically a female dominated industry. This is a real world example of a loud, boisterous man "disrupting" an industry.

It is entirely the wild west, where you can "cowboy shit and blaze new trails without giving thought to what others in the past have done" [1]. Anyone can be a digital archivist, anyone can run an archive (object store, metadata management, distribution). If I had to compare it to another industry, it'd be newspapers. Barrier to entry is low now (command line, compute, storage), and anyone can do it. This will continue as storage continues to decline in cost, tools get better (disclaimer: I maintain some tooling in this regard), and software improves for capturing physical materials as digital representations.

If someone thinks they can do better, they are free to try. No one is gatekeeping their attempt. Help yourself to some storage & VMs and write some code. If you do better (regardless of gender), everyone benefits.

[1] It's not bad to be able to wild west it and cowboy shit in non-regulated industries, where someone's life safety, finances, etc aren't at stake. Two cents.

Yee-haw.
Do you know more about what they think?
Here are some snippets from Archivists that are actively talking about this:

>1. Archiving isn't just capturing data or downloading it, it's making it available into the future. Without an intense amount of planning around that, the act of capturing is pointless.

>2. We don't need all of Yahoo Groups. We need a subset of Yahoo Groups. Choosing what, exactly, is worth keeping is called appraisal in the archives world - not like monetary appraisal but cultural appraisal.

>3. We also don't need all of Yahoo Groups because hoarding data long-term is terrible for the environment. Digital preservation is also terrible for the environment. So we should be extremely judicious about what digital content we choose to attempt to retain permanently.

>4. It's an incredible violation of privacy and doesn't align with the ethics of the archives profession to collect all that data without permission from the people involved, especially in the case of private groups. People should also have the right to be forgotten.

>It’s also part of a long-term pattern of IA (and this dude in particular) deciding to “archive” things that people have not given consent for—and, in some cases, have explicitly asked not to be preserved. There’s another gendered element here: IA tends to get tons of accolades and funding, and is largely seen as a group of do-gooder dudes just trying to preserve the internet. Meanwhile they routinely ignore or denigrate the work of librarians and archivists trained in digital preservation—professions that are overwhelmingly gendered as female.

>Plus Jason Scott is generally a dick to anyone who brings up any ethical qualms about their work (and he has a stupid hat in his avatar.)

I'm inclined to agree with this position. Does every Youtube, Reddit, Twitter, Hackernews, Facebook comment need to be archived and stored for the next thousand years?

I'd argue no, and there's a huge amount of waste in there - so many bot posts, or just spam.

But here we are, people want to archive every byte of information that traverses through the internet.

These days you have to approach cautiously, as every thing you do or post may be archived.

There are huge privacy concerns in the present day, but there's the other side of it too: today we consider it a treasure when we find "Maximus sucks a big dong" type graffiti on some wall in Pompeii. The glimpse into the life of the Romans is itself exhilarating. A thousand years hence you won't be around to care about how your embarrassing posts affect your reputation, and those who find it might be more grateful for the glimpse into early 21st century life than inclined to snigger or cringe at your comment.
> today we consider it a treasure when we find "Maximus sucks a big dong" type graffiti on some wall in Pompeii.

I thought the rareness of such finds is what makes it a treasure, not the content itself.

Exactly... if finding that stuff were extremely common the place would be nothing more than a running joke.
Speak for yourself ;)
> These days you have to approach cautiously, as every thing you do or post may be archived.

That was obvious in the 90s.

But sadly enough, it hasn't worked out that way. Before the Internet Archive anyway.

I don't really worry too much. Between all the stuff that doesn't get archived, the fact that the sheer volume tends to effectively hide any single piece of information, and the frequent difficulty of connecting online identities to specific real people, it's not like most people have a perfectly discoverable complete digital record online.
Which is fine as long as you remain anonymous and unimportant in the present day. It becomes much more of an issue if you abruptly become a person of interest, and all of a sudden there's a team of people motivated to go through that archive with a comb looking for the few juicy tidbits needed to publicly humiliate you.

And obviously this is already happening— in the Canadian election a few months ago, there was a big scandal where some pictures turned up of the prime minister wearing blackface at a holiday party he attended in 2001. And on top of that there was a pretty steady stream of candidates (some of whom were indeed booted from their parties) who were challenged over social media comments/posts made 5+ years earlier, especially on hot social topics where national sentiment has rapidly evolved in recent times.

Perhaps we will all just become inured to this, and rightfully be able to accept our leaders as human, judging them for who they are today and not past words and deeds. But is it possible to still retain the ability to fairly judge someone in the present while forgiving the past? Or will we lose the ability to discern the difference? The GOP's attitude toward president Trump does not give me hope on this.

True enough. I'm probably pretty happy that nothing I wrote or photographed got online anywhere without going through an editor until well after college. And in pre-digital/smartphone days a heck of a lot less was recorded for prosperity than today. Not a lot of pictures from parties etc. even in my archives (and I did a lot of photography undergrad).

And, although I assume I'm in various Usenet archives, my participation there was always from a work address and was pretty tame. (BBSs mostly were too although all that content is gone anyway AFAIK.)

I know there are already firms that do online history scrubbing of people who get some warning that they're about to be entering the public eye, but that kind of thing is probably going to be a major growth area. Even if there's content out there that's beyond one's ability to clean up, just being forewarned/reminded of its existence could be an advantage.

I could definitely picture political parties and businesses being willing to pay $$$ for an internet presence dossier as part of their larger candidate vetting process.

You never know.

For example, about Sabu etc: https://sites.google.com/site/avalonlogsefnet/

    Oct 29 12:28:53 <Cambion>    we have logs that date back to 1993 and identify pretty much everyone from nickname change to nickname change
    Oct 29 12:28:59 <Cambion>    phone numbers, everything
They don't archive social media content so this whole point is moot.