Hacker News new | ask | show | jobs
by EthanHeilman 2390 days ago
Take a CD-R of some MP3 with English language file names stored on a FAT32 filesystem for example. Assume the reflective layer didn't rust since it was abandoned in a dry climate and our future archaeologist has access to roughly modern levels of technology.

1. Even if the CD-R has been crushed and shattered you could use a modern and cheap microscope to read continuous pits and lands off the disk [0,1]. It would be clear to anyone familiar with information theory how to translate the pits and lands to a series of set of arbitrary symbols which encode data.

2. This data would at first be meaningless. However the mathematical relationships of a simple error correcting code would stand out. This would allow them recover corrupted data. Once the error correcting code was stripped out they have a transcript of the raw data.

3. They would notice a pattern in the data. There would be long high entropy regions and then very short low entropy regions. They would probably notice that some of the low entropy regions had every 8-th bit set to zero (ASCII) and if taken in 8-bit chunks these regions had the roughly the same number of symbols as in the latin alphabet. If they were familiar with English they might quickly decode these regions using letter frequency correspondence with another English text.

4. The high entropy regions would be far harder to decode. However these future archaeologists would be faced with the obvious data patterns of frames of an MP3. Decoding the first MP3 would be a serious project involving many institutions over many years but once it was done it would allow the decoding of all artifacts that use the MP3 and related encoding formats. Possibly someone would find a "rosetta file" [2], a disk that contained both a .wav file and an encoded MP3 of the same song. More likely someone would find an MP3 player and then reverse engineer the decoding algorithm.

[0]: "Being able to see the tracks and bits in a CD-ROM" https://superuser.com/questions/870776/being-able-to-see-the...

[1]: "CD-ROM Under the Microscope" https://www.youtube.com/watch?v=RZUxemOE07Q

[2]: https://en.wikipedia.org/wiki/Rosetta_Stone

2 comments

I mean, archaeology and linguistics have been figuring out ancient languages as an entire field, while determined individual hobbyists are able to reverse engineer unknown file formats.

By which I mean, many file formats are syntactically much simpler and more obviously structured than natural languages. It might take an entire field to reverse engineer weird formats like .DOC once all knowledge gets lost, but I doubt this will be the case for bitmaps or UTF-8 ...

Bitmaps are easy enough, but I wouldn't bet on UTF-8.

And any modern compression is probably right out without technological continuity.

I think if you gave a philologist living in 1880 AD a clay tablet with a binary inscription of a fragment of an English poem encoded UTF-8 they would decode it very quickly.

This is what the philologist would see:

>...ABABABBBABBABAAAABBBBAABAABABBAAAABAAAAAABBABAABABBAABBAAABAAAAAAABAABBBABBBABAAABBABAABABBBAABBAABAAAAAABBAABAAABBAAAABABBABBBAABBAAABBABBABAABABBABBBAABBAABBBAABAAAAAABBBBAABABBABBBBABBBABABAABAAAAAABBBABBBABBABBBBABBBABABABBABBAAABBAABAAAABAAAAAABBAAABAABBAABABAABABBAAAAAABABAABABABAAABBABAAAABBAABABABBBAABAABBAABABAABAABBBABBBAABBAABAAAAAABBAAABAABBBAABAABBABAABABBBAABBABBABABBABBAABABABBBAABAAABAAAAAABBBAAAAABBABAABABBBAAAAABB...

How it would probably go:

1. Hmmmm there are only two symbols A and B, these symbols can't be words since no language has only two words. Thus the words must be made of a string of these symbols.

2. Every 8-th symbol* is a A. Lets try putting the symbols in groups of size 8.

3. These groups of 8 can't be words because they repeat far too often and they would only allow 128 possible words. Thus these groups of 8 might be letters in an alphabet.

4. Does the frequency of this possible letters fit any known languages? Yes, English.

5. Which group of 8 is "e"?

A few minutes later and the clay tablet is decoded.

* - This is not always true in utf-8 but true in most encoding of Latin alphabets including this example. Even with some variable length characters thrown in this fact would stand out.

This is a very restricted subset of utf-8. I agree that the ASCII subset would not be tremendously difficult to decipher; the most interesting parts are laid out systematically and in order and case is even just a bit flip.

It's even fairly plausible that the utf-8 numerical encoding can be reverse-engineered from a few samples; enough languages' text generally only use characters from few enough blocks to identify. If you're really motivated, you can probably work your way through most of the languages with phonetic writing systems.

But then there's CJK Unified Ideographs, where the characters that get used are scattered essentially randomly because the ordering is only relevant if you already know how many and which characters were encoded at what point in the history of Unicode.

There are large swaths of Unicode which, if somehow totally lost, would essentially require finding font data or character reference tables to recover.

I agree recovering CJK Unified Ideographs encodings would be far harder than a phonetic alphabet, however a few things could make not as hard as it seems. The decoder has access to a text in both the future format and UTF-8. A text might mix phonetic words and ideographs as Japanese sometimes does today. The phonetic words would provide clues as to the ideographic characters.

Code breakers have decoded ciphertexts which used a code such that each word was replaced with a number. To make it even harder common words would be replaced by more than one numbers to defeat common frequency analysis techniques. This was done often with pen and paper.

Yuri Knorozov managed to decipher the Mayan script. That was a significantly harder task than recovering UTF-8 mappings because he has very little to work with on the source language (he did have somethings).

Exactly. You shouldn't underestimate the tremendous amount of work has been put into deciphering actual ancient languages using advanced techniques and minor contextual clues. Compared to that, deciphering most common UTF8 data would be relatively simple, meaning it could be done by a single person with some reverse engineering skills.
An engraved metal or stone tablet could be left along with the CDs to bootstrap the process. It could range from explaining the MP3 spec, to as simple as pictograms showing human speech being converted to microscopic pits. Explaining ASCII would be even easier.