| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by EthanHeilman 2389 days ago

I think if you gave a philologist living in 1880 AD a clay tablet with a binary inscription of a fragment of an English poem encoded UTF-8 they would decode it very quickly.

This is what the philologist would see:

>...ABABABBBABBABAAAABBBBAABAABABBAAAABAAAAAABBABAABABBAABBAAABAAAAAAABAABBBABBBABAAABBABAABABBBAABBAABAAAAAABBAABAAABBAAAABABBABBBAABBAAABBABBABAABABBABBBAABBAABBBAABAAAAAABBBBAABABBABBBBABBBABABAABAAAAAABBBABBBABBABBBBABBBABABABBABBAAABBAABAAAABAAAAAABBAAABAABBAABABAABABBAAAAAABABAABABABAAABBABAAAABBAABABABBBAABAABBAABABAABAABBBABBBAABBAABAAAAAABBAAABAABBBAABAABBABAABABBBAABBABBABABBABBAABABABBBAABAAABAAAAAABBBAAAAABBABAABABBBAAAAABB...

How it would probably go:

1. Hmmmm there are only two symbols A and B, these symbols can't be words since no language has only two words. Thus the words must be made of a string of these symbols.

2. Every 8-th symbol* is a A. Lets try putting the symbols in groups of size 8.

3. These groups of 8 can't be words because they repeat far too often and they would only allow 128 possible words. Thus these groups of 8 might be letters in an alphabet.

4. Does the frequency of this possible letters fit any known languages? Yes, English.

5. Which group of 8 is "e"?

A few minutes later and the clay tablet is decoded.

* - This is not always true in utf-8 but true in most encoding of Latin alphabets including this example. Even with some variable length characters thrown in this fact would stand out.

1 comments

naniwaduni 2389 days ago

This is a very restricted subset of utf-8. I agree that the ASCII subset would not be tremendously difficult to decipher; the most interesting parts are laid out systematically and in order and case is even just a bit flip.

It's even fairly plausible that the utf-8 numerical encoding can be reverse-engineered from a few samples; enough languages' text generally only use characters from few enough blocks to identify. If you're really motivated, you can probably work your way through most of the languages with phonetic writing systems.

But then there's CJK Unified Ideographs, where the characters that get used are scattered essentially randomly because the ordering is only relevant if you already know how many and which characters were encoded at what point in the history of Unicode.

There are large swaths of Unicode which, if somehow totally lost, would essentially require finding font data or character reference tables to recover.

link

EthanHeilman 2389 days ago

I agree recovering CJK Unified Ideographs encodings would be far harder than a phonetic alphabet, however a few things could make not as hard as it seems. The decoder has access to a text in both the future format and UTF-8. A text might mix phonetic words and ideographs as Japanese sometimes does today. The phonetic words would provide clues as to the ideographic characters.

Code breakers have decoded ciphertexts which used a code such that each word was replaced with a number. To make it even harder common words would be replaced by more than one numbers to defeat common frequency analysis techniques. This was done often with pen and paper.

Yuri Knorozov managed to decipher the Mayan script. That was a significantly harder task than recovering UTF-8 mappings because he has very little to work with on the source language (he did have somethings).

link

tripzilch 2386 days ago

Exactly. You shouldn't underestimate the tremendous amount of work has been put into deciphering actual ancient languages using advanced techniques and minor contextual clues. Compared to that, deciphering most common UTF8 data would be relatively simple, meaning it could be done by a single person with some reverse engineering skills.

link