| I think if you gave a philologist living in 1880 AD a clay tablet with a binary inscription of a fragment of an English poem encoded UTF-8 they would decode it very quickly. This is what the philologist would see: >...ABABABBBABBABAAAABBBBAABAABABBAAAABAAAAAABBABAABABBAABBAAABAAAAAAABAABBBABBBABAAABBABAABABBBAABBAABAAAAAABBAABAAABBAAAABABBABBBAABBAAABBABBABAABABBABBBAABBAABBBAABAAAAAABBBBAABABBABBBBABBBABABAABAAAAAABBBABBBABBABBBBABBBABABABBABBAAABBAABAAAABAAAAAABBAAABAABBAABABAABABBAAAAAABABAABABABAAABBABAAAABBAABABABBBAABAABBAABABAABAABBBABBBAABBAABAAAAAABBAAABAABBBAABAABBABAABABBBAABBABBABABBABBAABABABBBAABAAABAAAAAABBBAAAAABBABAABABBBAAAAABB... How it would probably go: 1. Hmmmm there are only two symbols A and B, these symbols can't be words since no language has only two words. Thus the words must be made of a string of these symbols. 2. Every 8-th symbol* is a A. Lets try putting the symbols in groups of size 8. 3. These groups of 8 can't be words because they repeat far too often and they would only allow 128 possible words. Thus these groups of 8 might be letters in an alphabet. 4. Does the frequency of this possible letters fit any known languages? Yes, English. 5. Which group of 8 is "e"? A few minutes later and the clay tablet is decoded. * - This is not always true in utf-8 but true in most encoding of Latin alphabets including this example. Even with some variable length characters thrown in this fact would stand out. |
It's even fairly plausible that the utf-8 numerical encoding can be reverse-engineered from a few samples; enough languages' text generally only use characters from few enough blocks to identify. If you're really motivated, you can probably work your way through most of the languages with phonetic writing systems.
But then there's CJK Unified Ideographs, where the characters that get used are scattered essentially randomly because the ordering is only relevant if you already know how many and which characters were encoded at what point in the history of Unicode.
There are large swaths of Unicode which, if somehow totally lost, would essentially require finding font data or character reference tables to recover.