| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by TheOtherHobbes 2391 days ago

This is not a solvable problem without technological continuity, or some unimaginably smart technology we can't imagine today.

If you found a mysterious archive object and had no idea what it was - CD-R, hard drive, SSD, whatever - not only would you have to reinvent an entire hardware reader around it, you would also have to work out the file structure, extract the data (some of which could be damaged), and reverse engineer the container file formats and the data structures inside them.

If you got all of that right, you'd eventually be able to start trying to translate the content of the text, audio, images, videos (how many compression formats are there?) into something you could understand.

A much more advanced civilisation would struggle with making a cold start on all of that. In our current state, we'd get nowhere if we didn't already have some records explaining where to begin.

2 comments

EthanHeilman 2391 days ago

Take a CD-R of some MP3 with English language file names stored on a FAT32 filesystem for example. Assume the reflective layer didn't rust since it was abandoned in a dry climate and our future archaeologist has access to roughly modern levels of technology.

1. Even if the CD-R has been crushed and shattered you could use a modern and cheap microscope to read continuous pits and lands off the disk [0,1]. It would be clear to anyone familiar with information theory how to translate the pits and lands to a series of set of arbitrary symbols which encode data.

2. This data would at first be meaningless. However the mathematical relationships of a simple error correcting code would stand out. This would allow them recover corrupted data. Once the error correcting code was stripped out they have a transcript of the raw data.

3. They would notice a pattern in the data. There would be long high entropy regions and then very short low entropy regions. They would probably notice that some of the low entropy regions had every 8-th bit set to zero (ASCII) and if taken in 8-bit chunks these regions had the roughly the same number of symbols as in the latin alphabet. If they were familiar with English they might quickly decode these regions using letter frequency correspondence with another English text.

4. The high entropy regions would be far harder to decode. However these future archaeologists would be faced with the obvious data patterns of frames of an MP3. Decoding the first MP3 would be a serious project involving many institutions over many years but once it was done it would allow the decoding of all artifacts that use the MP3 and related encoding formats. Possibly someone would find a "rosetta file" [2], a disk that contained both a .wav file and an encoded MP3 of the same song. More likely someone would find an MP3 player and then reverse engineer the decoding algorithm.

[0]: "Being able to see the tracks and bits in a CD-ROM" https://superuser.com/questions/870776/being-able-to-see-the...

[1]: "CD-ROM Under the Microscope" https://www.youtube.com/watch?v=RZUxemOE07Q

[2]: https://en.wikipedia.org/wiki/Rosetta_Stone

link

tripzilch 2391 days ago

I mean, archaeology and linguistics have been figuring out ancient languages as an entire field, while determined individual hobbyists are able to reverse engineer unknown file formats.

By which I mean, many file formats are syntactically much simpler and more obviously structured than natural languages. It might take an entire field to reverse engineer weird formats like .DOC once all knowledge gets lost, but I doubt this will be the case for bitmaps or UTF-8 ...

link

naniwaduni 2391 days ago

Bitmaps are easy enough, but I wouldn't bet on UTF-8.

And any modern compression is probably right out without technological continuity.

link

EthanHeilman 2390 days ago

I think if you gave a philologist living in 1880 AD a clay tablet with a binary inscription of a fragment of an English poem encoded UTF-8 they would decode it very quickly.

This is what the philologist would see:

>...ABABABBBABBABAAAABBBBAABAABABBAAAABAAAAAABBABAABABBAABBAAABAAAAAAABAABBBABBBABAAABBABAABABBBAABBAABAAAAAABBAABAAABBAAAABABBABBBAABBAAABBABBABAABABBABBBAABBAABBBAABAAAAAABBBBAABABBABBBBABBBABABAABAAAAAABBBABBBABBABBBBABBBABABABBABBAAABBAABAAAABAAAAAABBAAABAABBAABABAABABBAAAAAABABAABABABAAABBABAAAABBAABABABBBAABAABBAABABAABAABBBABBBAABBAABAAAAAABBAAABAABBBAABAABBABAABABBBAABBABBABABBABBAABABABBBAABAAABAAAAAABBBAAAAABBABAABABBBAAAAABB...

How it would probably go:

1. Hmmmm there are only two symbols A and B, these symbols can't be words since no language has only two words. Thus the words must be made of a string of these symbols.

2. Every 8-th symbol* is a A. Lets try putting the symbols in groups of size 8.

3. These groups of 8 can't be words because they repeat far too often and they would only allow 128 possible words. Thus these groups of 8 might be letters in an alphabet.

4. Does the frequency of this possible letters fit any known languages? Yes, English.

5. Which group of 8 is "e"?

A few minutes later and the clay tablet is decoded.

* - This is not always true in utf-8 but true in most encoding of Latin alphabets including this example. Even with some variable length characters thrown in this fact would stand out.

link

naniwaduni 2390 days ago

This is a very restricted subset of utf-8. I agree that the ASCII subset would not be tremendously difficult to decipher; the most interesting parts are laid out systematically and in order and case is even just a bit flip.

It's even fairly plausible that the utf-8 numerical encoding can be reverse-engineered from a few samples; enough languages' text generally only use characters from few enough blocks to identify. If you're really motivated, you can probably work your way through most of the languages with phonetic writing systems.

But then there's CJK Unified Ideographs, where the characters that get used are scattered essentially randomly because the ordering is only relevant if you already know how many and which characters were encoded at what point in the history of Unicode.

There are large swaths of Unicode which, if somehow totally lost, would essentially require finding font data or character reference tables to recover.

link

EthanHeilman 2390 days ago

I agree recovering CJK Unified Ideographs encodings would be far harder than a phonetic alphabet, however a few things could make not as hard as it seems. The decoder has access to a text in both the future format and UTF-8. A text might mix phonetic words and ideographs as Japanese sometimes does today. The phonetic words would provide clues as to the ideographic characters.

Code breakers have decoded ciphertexts which used a code such that each word was replaced with a number. To make it even harder common words would be replaced by more than one numbers to defeat common frequency analysis techniques. This was done often with pen and paper.

Yuri Knorozov managed to decipher the Mayan script. That was a significantly harder task than recovering UTF-8 mappings because he has very little to work with on the source language (he did have somethings).

link

steve19 2391 days ago

An engraved metal or stone tablet could be left along with the CDs to bootstrap the process. It could range from explaining the MP3 spec, to as simple as pictograms showing human speech being converted to microscopic pits. Explaining ASCII would be even easier.

link

nyolfen 2391 days ago

the storage part at least could be a solved problem: https://en.wikipedia.org/wiki/5D_optical_data_storage

link