Hacker News new | ask | show | jobs
by EthanHeilman 2390 days ago
Maybe we should print this out on acid-free paper-thin flexible wood-pulp sheets stitched to together to form linear organized aggregations. Each aggregation would contain one or more works and be searchable using a SQL-like database. To make this plan really work there would need to be a collection of geographically distributed long term physical repositories that would receive periodic updates as new material became available.

All joking aside, I do wonder wither digital or analogue formats are better able to survive into the distant future.

* What impact will DRM have on the accessibility of our knowledge to future historians?

* Is anything recoverable from a harddrive or flash media after 500 years in a landfill?

* Will compressed files be more of less recoverable? What about git archives?

* Will the future know the shape of our plastic GI Joes toys but not the content of the GI Joes cartoon?

5 comments

> I do wonder wither digital or analogue formats are better able to survive into the distant future.

There are 5000 year old clay tablets we can still read.

There are centuries old documents on paper, vellum etc. that we can still read.

I personally have decades-old paper documents I can easily read, and a box of floppies I can't.

It's not just a problem of unreadable physical media, I have a database file on a perfectly readable HD that was generated by an application that is no longer available. I might be able to interrogate it somehow, but it won't be easy.

Digital formats and connectivity make LOCKSS easier, so that's a plus. There's less chance of a fire or flood or space-limited librarian destroying the last known copy. However, without archivists actively transforming content to new formats as required, it might only take a few decades before a lot of content starts to require a massive effort to read.

Clay is the plastic of the ancient world.

Let's say the probability that: a single copy of a physical book survives 1,000 years, is found and is understood by an archaeologist, is pB and the probability that a single copy of a book on an SSD survives 1,000 years is found and understood by an archaeologist is pD. Even if pB is far larger than pD it could be the case that there might be so many more copies of single book held on SSDs thus making it more likely the book will survive via an SSD than a physical book. On the other hand the technology to recover data from SSDs might not exist in 1,000 years.

It could also be the case that each generation would copy these books onto new digital media providing an unbroken chain of copies. The oldest copy of the Iliad is Venetus.A which is from 1000AD (1000 years ago) despite the Iliad probably first being written down in 800BC (2800 years ago). It was copied from earlier copies of copies of copies.

I really don't know how this will play out and I've been unable to find research on how long SSD and flash memory based media survives especially if buried in a landfill.

* - If archaeologists exist in the future. The current push from the STEM boosters to defund and de-emphasize the humanities may result in a near-future without archaeologists or funded archaeological projects. Over 1,000 years the entire field could die.

> thus making it more likely the book will survive via an SSD than a physical book

Yes. That's what I mean by LOCKSS being easier.

> is found and is understood by an archaeologist,

There is a problem with merging these two probabilities.

The probability of finding a book is of course massively smaller than the probability of finding a digital copy.

The probability of understanding a book is so much greater than the probability of understanding a file on a disk.

This makes it more likely that the physical book will survive in a meaningful way.

> It could also be the case that each generation would copy these books onto new digital media

This is what I mean by archivists actively transforming the content. Regarding written content like the Iliad, copies and translations can be made centuries apart. Content in digital formats may need to be transformed whenever the application that reads it is discontinued.

Would an SSD even function after 1000 years? Unless sealed, I imagine ambient moisture would do a number inside the drive. The same is true for books of course, but we still have 1000 year old books that have lasted by sitting on a shelf in churches and temples, etc., without any specific care until recent history.

The nice part of a book in an apocalyptic scenario is that you can copy it even if you don't know the language. You don't need a special tool for this, only one capable of marking a surface. It wouldn't be fun or fast, but it's possible and it's what monks did for centuries. Would archeologists 1000 years from now be lucky enough to find a SATA cable too?

It doesn't really matter if the SSD as a whole still works, because after 1000 years you'll never recover the data via the normal interface. Modern MLC flash is often specified for less than 1 year data retention, and even SLC is unlikely to make it to 1000 years. Attempting to read it will only make things worse ("read disturb"). The best hope of saving the data is with some future nanotech that directly probes each floating gate transistor and counts the electrons, and reverse engineering all the error correction and wear leveling.
I would assume they would read the SSD not by powering it on and plugging it into to a computer but by disassembling it and physically imaging the physical structure. This would also bypass the all the write leveling infrastructure allowing them to recover deleted data. It reminds me of the current techniques of using x-rays to read writing on the odd scraps of paper used to bind a book [0].

[0]: "X-rays reveal 1,300-year-old writings inside later bookbindings" https://www.theguardian.com/books/2016/jun/04/x-rays-reveal-...

No one is proposing we use floppy disks.

Redundant, shared servers ARE a forever solution. Making sure your data is one one of the ones that makes it seems like a vastly easier proposition to me than writing data to clay tablets and trying to keep those from ending up in a dump somewhere.

What is the likelihood that historians a century or two hence will have an application capable of turning an ISO 32000-1 file into a human-readable text?

If we are talking about archaeologists, rather than historians, even ASCII and Unicode could be a challenge to work out.

Because those hundreds of years don't transpire in a glimpse. At some point in the middle there will be deprecated formats and new ones, and transcoders you can batch run. Sure it relies on intervention, but the upside is any/everyone else can copy the one persons work.

Yes we should learn from history, but we should also not assume that everything that happened before will happen the same way again, given how much of our world has changed.

> However, without archivists actively transforming content to new formats as required, it might only take a few decades before a lot of content starts to require a massive effort to read.
More effort than batch reading physical books and tablets in old languages?

You can reuse interfaces easier on data, and current ML could probably pull some of the weight of interpreting old data right now, not to mention what we have 50 years from now.

0.99999 at least.

Compare the capabilities of digital historians today to those 10- and 20-years ago respectively. It’s night and day.

This is not a solvable problem without technological continuity, or some unimaginably smart technology we can't imagine today.

If you found a mysterious archive object and had no idea what it was - CD-R, hard drive, SSD, whatever - not only would you have to reinvent an entire hardware reader around it, you would also have to work out the file structure, extract the data (some of which could be damaged), and reverse engineer the container file formats and the data structures inside them.

If you got all of that right, you'd eventually be able to start trying to translate the content of the text, audio, images, videos (how many compression formats are there?) into something you could understand.

A much more advanced civilisation would struggle with making a cold start on all of that. In our current state, we'd get nowhere if we didn't already have some records explaining where to begin.

Take a CD-R of some MP3 with English language file names stored on a FAT32 filesystem for example. Assume the reflective layer didn't rust since it was abandoned in a dry climate and our future archaeologist has access to roughly modern levels of technology.

1. Even if the CD-R has been crushed and shattered you could use a modern and cheap microscope to read continuous pits and lands off the disk [0,1]. It would be clear to anyone familiar with information theory how to translate the pits and lands to a series of set of arbitrary symbols which encode data.

2. This data would at first be meaningless. However the mathematical relationships of a simple error correcting code would stand out. This would allow them recover corrupted data. Once the error correcting code was stripped out they have a transcript of the raw data.

3. They would notice a pattern in the data. There would be long high entropy regions and then very short low entropy regions. They would probably notice that some of the low entropy regions had every 8-th bit set to zero (ASCII) and if taken in 8-bit chunks these regions had the roughly the same number of symbols as in the latin alphabet. If they were familiar with English they might quickly decode these regions using letter frequency correspondence with another English text.

4. The high entropy regions would be far harder to decode. However these future archaeologists would be faced with the obvious data patterns of frames of an MP3. Decoding the first MP3 would be a serious project involving many institutions over many years but once it was done it would allow the decoding of all artifacts that use the MP3 and related encoding formats. Possibly someone would find a "rosetta file" [2], a disk that contained both a .wav file and an encoded MP3 of the same song. More likely someone would find an MP3 player and then reverse engineer the decoding algorithm.

[0]: "Being able to see the tracks and bits in a CD-ROM" https://superuser.com/questions/870776/being-able-to-see-the...

[1]: "CD-ROM Under the Microscope" https://www.youtube.com/watch?v=RZUxemOE07Q

[2]: https://en.wikipedia.org/wiki/Rosetta_Stone

I mean, archaeology and linguistics have been figuring out ancient languages as an entire field, while determined individual hobbyists are able to reverse engineer unknown file formats.

By which I mean, many file formats are syntactically much simpler and more obviously structured than natural languages. It might take an entire field to reverse engineer weird formats like .DOC once all knowledge gets lost, but I doubt this will be the case for bitmaps or UTF-8 ...

Bitmaps are easy enough, but I wouldn't bet on UTF-8.

And any modern compression is probably right out without technological continuity.

I think if you gave a philologist living in 1880 AD a clay tablet with a binary inscription of a fragment of an English poem encoded UTF-8 they would decode it very quickly.

This is what the philologist would see:

>...ABABABBBABBABAAAABBBBAABAABABBAAAABAAAAAABBABAABABBAABBAAABAAAAAAABAABBBABBBABAAABBABAABABBBAABBAABAAAAAABBAABAAABBAAAABABBABBBAABBAAABBABBABAABABBABBBAABBAABBBAABAAAAAABBBBAABABBABBBBABBBABABAABAAAAAABBBABBBABBABBBBABBBABABABBABBAAABBAABAAAABAAAAAABBAAABAABBAABABAABABBAAAAAABABAABABABAAABBABAAAABBAABABABBBAABAABBAABABAABAABBBABBBAABBAABAAAAAABBAAABAABBBAABAABBABAABABBBAABBABBABABBABBAABABABBBAABAAABAAAAAABBBAAAAABBABAABABBBAAAAABB...

How it would probably go:

1. Hmmmm there are only two symbols A and B, these symbols can't be words since no language has only two words. Thus the words must be made of a string of these symbols.

2. Every 8-th symbol* is a A. Lets try putting the symbols in groups of size 8.

3. These groups of 8 can't be words because they repeat far too often and they would only allow 128 possible words. Thus these groups of 8 might be letters in an alphabet.

4. Does the frequency of this possible letters fit any known languages? Yes, English.

5. Which group of 8 is "e"?

A few minutes later and the clay tablet is decoded.

* - This is not always true in utf-8 but true in most encoding of Latin alphabets including this example. Even with some variable length characters thrown in this fact would stand out.

This is a very restricted subset of utf-8. I agree that the ASCII subset would not be tremendously difficult to decipher; the most interesting parts are laid out systematically and in order and case is even just a bit flip.

It's even fairly plausible that the utf-8 numerical encoding can be reverse-engineered from a few samples; enough languages' text generally only use characters from few enough blocks to identify. If you're really motivated, you can probably work your way through most of the languages with phonetic writing systems.

But then there's CJK Unified Ideographs, where the characters that get used are scattered essentially randomly because the ordering is only relevant if you already know how many and which characters were encoded at what point in the history of Unicode.

There are large swaths of Unicode which, if somehow totally lost, would essentially require finding font data or character reference tables to recover.

An engraved metal or stone tablet could be left along with the CDs to bootstrap the process. It could range from explaining the MP3 spec, to as simple as pictograms showing human speech being converted to microscopic pits. Explaining ASCII would be even easier.
the storage part at least could be a solved problem: https://en.wikipedia.org/wiki/5D_optical_data_storage
Pretty much everyone in a tech job could afford to buy 40TB of storage at home, or remotely and mirror the entire repo. I think that given this low barrier of entry if you can afford to help preserve the information then you can and probably should. Even if a small amount do it it's more points of recovery.
Storage isn't hard, but downloading 40 Tb can be a problem. Are there any arrangements for physical distribution (of the "truck loaded with USB drives" variety)?
Id say.... in the day .... anyone could afford to buy a single floppy disk and store files on it. But how many actually did and how many are actually recoverable. Lots probably got thrown out in intervening years.
In the GLAM sector the LOCKSS[1] is project is quite well-known. It tries to deal with some of the resiliency problems that is inherent in digital preservation. However, I'd guess this system does not offer the needed anonymity.

[1] https://www.lockss.org/ ; https://en.wikipedia.org/wiki/LOCKSS

Forget DRM, even future Engilsh may be incomprehensible. There is an entire field of study dedicated to finding a way to make our future voice heard, without a good plausible solution, called Nuclear Semiotics (https://en.wikipedia.org/wiki/Nuclear_semiotics).

If we can't effectively warn a future (>10,000 years) generation to stay away from something that may harm or kill them, what chance do we have of making a universally understandable archive of data?