|
|
|
|
|
by naniwaduni
2389 days ago
|
|
This is a very restricted subset of utf-8. I agree that the ASCII subset would not be tremendously difficult to decipher; the most interesting parts are laid out systematically and in order and case is even just a bit flip. It's even fairly plausible that the utf-8 numerical encoding can be reverse-engineered from a few samples; enough languages' text generally only use characters from few enough blocks to identify. If you're really motivated, you can probably work your way through most of the languages with phonetic writing systems. But then there's CJK Unified Ideographs, where the characters that get used are scattered essentially randomly because the ordering is only relevant if you already know how many and which characters were encoded at what point in the history of Unicode. There are large swaths of Unicode which, if somehow totally lost, would essentially require finding font data or character reference tables to recover. |
|
Code breakers have decoded ciphertexts which used a code such that each word was replaced with a number. To make it even harder common words would be replaced by more than one numbers to defeat common frequency analysis techniques. This was done often with pen and paper.
Yuri Knorozov managed to decipher the Mayan script. That was a significantly harder task than recovering UTF-8 mappings because he has very little to work with on the source language (he did have somethings).