| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by yrlf 2219 days ago

This is a really good post that shines some light on how the insanity of encodings still isn't fixed today, since so many operating systems still don't completely use Unicode everywhere.

Some of the reasonings behind why the characters are displayed like that are slightly incorrect, though, so here are some corrections:

I'm going to supply each example here with some python3 code to reproduce with, with the following definition:

`data = b"a\xcc\xb6\xcc\x81\xcc\x93\xcc\xbf\xcc\x88\xcc\x9b\xcc\x9b\xcd\x90\xcd\x98\xcd\x86\xcc\x90\xcd\x9d\xcc\x87\xcc\x92\xcc\x91\xcd"`

First, let's start at the beginning:

> My router just cut the name down to 32 octets though to stay complient > This was what was being sent according to iw > `a\xcc\xb6\xcc\x81\xcc\x93\xcc\xbf\xcc\x88\xcc\x9b\xcc\x9b\xcd\x90\xcd\x98\xcd\x86\xcc\x90\xcd\x9d\xcc\x87\xcc\x92\xcc\x91\xcd`

If you look at this closely, the last byte in this sequence is `\xcd`, which is an incomplete UTF-8 character. It's missing the final `\x84` that the router cut off (along with the three additional `a` characters).

> with the raw hex being > `97ccb6cc81cc93ccbfcc88cc9bcc9bcd90cd98cd86cc90cd9dcc87cc92cc91cd`

small mistake: the hex of `a` is `61`, not `97` (that's decimal), but otherwise correct.

> Galaxy S8 running Android 9 with Kernel 4.4.153 > Amazon Firestick

Everything correct, except for a small detail:

These two devices render the result of UTF-8 decoding while ignoring bytes that are invalid unicode (in python3: `data.decode('utf-8', 'ignore')`)

> iPhone 6 running iOS 13.5.1 > Apple TV Second Generation

Completely correct. This is definitely Mac OS Roman (in python3: `data.decode('mac_roman')`)

> Windows 10 Pro 10.0.19041

This one is a incorrect again:

Windows is interpreting the characters in the "Windows Codepage 1252" (also known as "Western") encoding and ignoring invalid characters (in python3: `data.decode('cp1252', 'ignore')`)

Decoding every character separately as UTF-8 would fail (since every byte that can be a continuation of a UTF-8 character is not a valid start byte).

Interpreting every character as a Unicode code-point number would give something very similar, but not exactly the same: What Windows decodes as quote, caret-y thing, angle bracket-y thing, tilde, dagger, double dagger, and single quote fall into a control character block at the start of the Unicode "Latin-1 Supplement" block (`\x80` to `\x9f`).

> Chromebook running ChromeOS 83.0.4103.97

Correct.

The Chromebook seems to have rendered the ASCII a, but replaced all other 31 characters with question marks.

> Kindle Paperwhite running Firmware 5.10.2 > Vizio M55-C2 TV

Also correct.

Those two devices seem to opt to display hex instead of falling back to question marks as the Chromebook does.

I hope this comment gave some useful insight into why these devices decoded it this way :)

1 comments

herohamp 2219 days ago

Hey, I am the OP. Thank you so much I will go through and amend what I got wrong, anyway that you wish for me to credit you?

link

yrlf 2219 days ago

If you want to credit me, just tag my twitter :)

(@theFerdi265)

link