Hacker News new | ask | show | jobs
by CarelessExpert 1703 days ago
> Because they are so JS-heavy, and reliant on CI/CD pipelines for deployment, on custom CMSes, there is no way to archive them in the way that static pages containing just text and images can be archived on the Wayback Machine.

Welcome to the world of digital archiving. It's an enormously complicated space, and even for just my own personal projects and content, I've spent a lot of time thinking about how to ensure things are future proof and can be archived easily.

As a simple example, building my personal website atop Markdown ensures that, even if the formatting can't be preserved, the core content will be since it's simple ASCII (yes, that's ignoring issues of long-term digital storage and access and so forth, but at least it's not also a bunch of binary blobs or database formats or whatnot).

Equally alarming is that fact that so much of our digital lives aren't even in our control. A historian used to be able to rely on family archives, public libraries, etc, to understand our past. A hundred years from now they'll be looking back and hoping someone somewhere preserved the contents of an S3 bucket before Amazon decided to delete it on a whim...

5 comments

Some people actually use those "takeout" features to collect archive data. So then you do get the archive, it's like having somebody's cuttings from a local newspaper rather than a complete set of local papers on microfiche.

One reason I take these is that I have RAM and I have grep and apparently either the people who had the data don't have RAM or they don't have grep, and so while I can ask my local Facebook archive "Er, didn't I write something about anti-freeze?" and get an answer in seconds, Facebook itself will try to suggest I might want pages about anti-freeze, a group that cares about anti-freeze, a sponsored advert for anti-freeze ... and not the thing I wrote.

Facebook's search and suggestion engine is hilariously broken.

Say, I am commenting in a thread trying to respond to John Smith. That's the only person whose name starts with a J.

If I start typing @J..., the suggestions would be for literally anyone else but John Smith in the thread.

On their mobile website (which lags behind the app), typing @John Smith will sometimes suggest a number of John Smiths, none of them being the one in the thread I am writing in.

Same with friends. If I want to tag a friend of mine and start typing their name, I usually get suggestions for random people first (neither from my friend list or the comment thread).

Why on Earth is the list not prioritized by (friends in thread) / (everyone else in thread) / (friends) / (everyone else) is absolutely beyond me.

Once you do manage to tag @JohnSmith, he will get a notification that he has been tagged in the thread. One notification per thread, regardless of the number of individual posts he was tagged in.

The link on the notification will take him to the top of the thread.

Depending on the thread's popularity, John could have a very difficult time finding the posts he's tagged in.

These are just symptoms, though, not coding mistakes.

Facebook literally wants you to be caught up in wading through their posts, spending your life on their website.

Ha ha, perfect - you both summed up the hellhole that is Facebook commenting so well.
> Some people actually use those "takeout" features to collect archive data. So then you do get the archive, it's like having somebody's cuttings from a local newspaper rather than a complete set of local papers on microfiche.

The problem with these (at least the Facebook ones) is that the data is lacking all context. It's kinda OK if you just want copies of your photos, but I can't make heads or tails of most my comments from the archive, and posts are missing a lot without the comments.

I have this exact same situation with my Youtube history. I used to be able to search it in Google's history search, but that only shows a subset of results.

If I want to really search for the title of a random video I saw 5 years ago the only option is to download a raw CSV of my history and use grep :(

Fwiw, the Internet Archive is very much trying to avoid the random S3 bucket deletion problem, and donations to them are tax deductible.

The issues of long-term digital storage are such that - use whatever you want for your own blog - but (imo) ASCII isn't going to save you any more than binary blobs are, 300 years into the future after we're all long gone and buried. We're already in a world where UTF-8 is taking over in many places. (Many places but not all. Fun fact, you can't send Zelle to someone with an emoji in their local contact name with some banks.)

If I (today) said I had a word document and needed "an old version of Microsoft Word", I'm sure most people would know what I mean, and that I'd find someone with a Windows XP machine and a copy of Office 97'. Meanwhile, there are tons of people who are just going to stare at you blankly if you tell them about EBCDIC, never mind help you find a decoder.

> If I (today) said I had a word document and needed "an old version of Microsoft Word", I'm sure most people would know what I mean, and that I'd find someone with a Windows XP machine and a copy of Office 97'. Meanwhile, there are tons of people who are just going to stare at you blankly if you tell them about EBCDIC, never mind help you find a decoder.

Funny, I suspect the precise reverse is true.

EBCDIC is a well-documented encoding. Worst case, find you a reference book and you can figure out how to deal with it, because that knowledge is open and available.

The same is true of ASCII. If you can understand binary encodings with 8-bit groupings--a fairly fundamental concept in digital computing--you can probably find your way to an ASCII table in a library somewhere.

But good luck finding a working Windows XP machine with Office '97 fifty or one hundred years from now, let alone a spec for the format.

the part about the spec of that office97 format is more or less taken care of by the libreoffice project
And once the maintained version of Libre Office inevitably drops office97 support you are back at having to find old Libre Office versions and trying to get them to run or port the code.
And that's ignoring the fact that code is a terrible spec. Trying to reverse engineer a file format from a software implementation is a godawful nightmare, and I say that from personal experience.

Given the choice between that and having to figure out how 8-bit ASCII works, it's pretty clear which is the easier problem to solve.

7-bit ASCII is a subset of UTF-8, so ASCII is fine in a UTF-8 world.
> If I (today) said I had a word document and needed "an old version of Microsoft Word"

Modern Word versions still load Word 97 docs. There's a decent chance Office versions from around that time still work on Windows 10.

>As a simple example, building my personal website atop Markdown ensures that, even if the formatting can't be preserved

That's why I built my personal ADHD blog[1] on TiddlyWiki[2].

It's a self-contained HTML page that has everything.

I could have even embedded the images.

You can archive it with *File -> Save As...* (single-file .mht works).

[1] https://romankogan.net/adhd

[2] https://tiddlywiki.com

I still don't get why Firefox doesn't support MHT(ML)(=EML), while Thunderbird does, considering how that's pretty much the best digital document format we have...
They used to support MHTML prior to the Quantum update, via the Mozilla Archive Format addon or the superior UnMHT addon (which captured pages more accurately than Chromium's MHTML support did in direct comparisons I made).

Not sure why they dropped support for it entirely since, given it was supported for the longest time and it's the most convenient single file format for web page saving. It's a major reason I couldn't continue using Firefox as my main browser.

Mozilla is too busy inserting ads and removing useful features from Firefox.
I'm just waiting until some other organization says enough is enough and forks Firefox. More likely than Mozilla pulling their head out of their ass IMO.
But it already has been forked, several times to boot ?
One aspect of this is to look at the ways that history is being rewritten now from original materials. All of the -isms of the 1900's painted a picture of straight, white (male) Captains of Industry paving a way to the future, and in revisiting the source materials we are discovering that this image paved over a lot of people that were doing a lot of heavy lifting.

History is full of assistants, spinsters and confirmed bachelors whose stories are being re-told now from diaries and correspondence letters that have been family heirlooms for generations. You can't trust the contemporary reports as accurate, because they have a different agenda than we do 20, 40, 100 years in the future. We only knew of Marie Curie within her own lifetime, less because her work was so profound, but because she had a husband in her own field who conspired with her to subvert a system that didn't want to give her standing. A partner outside your field can't do much for you, and a more selfish collaborator wouldn't.

Who knows what polite fictions are being told about people now that will be reframed by our grandchildren, assuming that scholars can find any of it. If I had to guess it will be neurodiversity. Probably/hopefully doing away with the Tortured Genius trope.

TBH, IMO this is all a non sequitur

My point is that the nature of digital technologies is such that information is far more ephemeral and closed off than it's ever been, not just for historians but for we, the people who are creating that information. We produce a lot more information, but control and long-term preservation is infinitely harder.

Your observations regarding the challenge of historians is absolutely true. But the effects of technology are entirely orthogonal to that problem.

After all, even if we had perfect digital preservation, what you say is still true, if only because subjugated groups are less represented in the digital discourse for many reasons, including socioeconomics, direct censorship/interference from power groups, etc.

"My point is that the nature of digital technologies is such that information is far more ephemeral and closed off than it's ever been"

I don't think I agree with that. For a lot of pre-historical research the only thing we have to go on is fossils and rock formations. Our picture about dinosaurs is extremely ephemeral and extrapolated from a very small number of things in the grand scheme of history, I don't think we can even begin to imagine the sheer number of events that happened in the total history of organic life forms that resulted in the current state of things. But knowing those things is really important for a lot of scientific fields.

Edit: Also I guess I just don't see why digital information is really significant here. It seems just as likely for a marginalized person without safety to have a physical notebook or photo album get lost or destroyed, for example.

> Welcome to the world of digital archiving. It's an enormously complicated space, and even for just my own personal projects and content, I've spent a lot of time thinking about how to ensure things are future proof and can be archived easily.

For web content, in my eyes it's a pretty cut and dry example - if the authors of any piece of content don't want it to be archived and aren't forthcoming in making this archival a viable pursuit, then the content simply should not be archived. Alternatively, just get a static PDF of it for future reference instead of fighting an uphill battle against webpages and even software that's user hostile.

For your own content, however, i think that you're on the right track. Use simple file formats, have tested backups and ideally rely on stable, boring software that's also slow to evolve and change.

If we had only things from ancient history which people wanted to be archived we'd have a very different view on history. True future historians will have a lot more material about our present, but that still shouldn't limit our intent for archival of as much as we can. Only future will know what is relevant
> True future historians will have a lot more material about our present, but that still shouldn't limit our intent for archival of as much as we can.

In my eyes, unless you have a personal interest in the material that you want to archived that would make you overlook most complications, the burden of preserving or even making their information easy to preserve should lie on its authors.

For example, if you as a person want to make your voice be heard through centuries, then it should be upon you to use open data formats or even something as simple as Markdown, as opposed to binary .docx files or similar formats that will have significant problems related to reading them.

Furthermore, there is no actual guarantee of anyone actually caring about what you (or, let's say, i) might say, for example, in an offhanded remark on Twitter about seeing some cute kittens today, apart from any such data being included in a larger analysis of bulk data.

I'm certainly not against the idea of archiving or preserving data, but it seems that most larger events out there will have a huge amount of coverage either way.