Hacker News new | ask | show | jobs
by naniwaduni 2389 days ago
This is a very restricted subset of utf-8. I agree that the ASCII subset would not be tremendously difficult to decipher; the most interesting parts are laid out systematically and in order and case is even just a bit flip.

It's even fairly plausible that the utf-8 numerical encoding can be reverse-engineered from a few samples; enough languages' text generally only use characters from few enough blocks to identify. If you're really motivated, you can probably work your way through most of the languages with phonetic writing systems.

But then there's CJK Unified Ideographs, where the characters that get used are scattered essentially randomly because the ordering is only relevant if you already know how many and which characters were encoded at what point in the history of Unicode.

There are large swaths of Unicode which, if somehow totally lost, would essentially require finding font data or character reference tables to recover.

1 comments

I agree recovering CJK Unified Ideographs encodings would be far harder than a phonetic alphabet, however a few things could make not as hard as it seems. The decoder has access to a text in both the future format and UTF-8. A text might mix phonetic words and ideographs as Japanese sometimes does today. The phonetic words would provide clues as to the ideographic characters.

Code breakers have decoded ciphertexts which used a code such that each word was replaced with a number. To make it even harder common words would be replaced by more than one numbers to defeat common frequency analysis techniques. This was done often with pen and paper.

Yuri Knorozov managed to decipher the Mayan script. That was a significantly harder task than recovering UTF-8 mappings because he has very little to work with on the source language (he did have somethings).

Exactly. You shouldn't underestimate the tremendous amount of work has been put into deciphering actual ancient languages using advanced techniques and minor contextual clues. Compared to that, deciphering most common UTF8 data would be relatively simple, meaning it could be done by a single person with some reverse engineering skills.