Hacker News new | ask | show | jobs
by keithgabryelski 3390 days ago
Here is my take on this subject:

A Practical Guide to Character Sets and Encodings or: What’s all this about ASCII, Unicode and UTF-8?

https://medium.com/@keithgabryelski/a-practical-guide-to-cha...

2 comments

>Character Sets: a collection of characters associated with numeric values. These pairings are called “code points”.

This is very ambiguous definition and it can be very confusing. I sure there are many people who have read Joel Spolsky's Unicode intro and left confused.

Using ASCII as an example is confusing because ASCII character maps into several different Unicode concepts:

1. byte

2. code point

3. encoded character

4. grapheme

5. grapheme cluster

6. abstract character

7. user perceived character

Mapping from user perceived character to abstract characters is not total,injective, or surjective. Some abstract characters need more than one code point to express them. You can't split sequence of Unicode code points arbitrarily in code point boundaries, you must use grapheme clusters instead.

> 1. byte

Actually 'code unit', which may have any number of bits, depending on the encoding. Otherwise spot on.

it's a practical guide, not comprehensive -- for most people this is a great start, especially if they are familiar with ASCII

you'll notice I didn't cover collation -- why? because explaining that would dilute process of understand UTF-x and UNICODE

It's pedagogically wrong and extremely misleading.

When people start with introduction like this, they end up thinking they have learned more than they actually have.

I point this out because I was one of those mislead by several previous articles explaining Unicode encoding the same wrong way as you did. When I ran into trouble and asked help, everybody around me was misguided the same way. I didn't know I had to dig into manuals because everybody explained that this is how it is. Then I had to teach everyone else that we had all learned it wrong.

Many people deal only with ascii or other easy western alphabets and they can work years with Unicode before they hit into trouble.

Please don't call code points "characters". This is wrong and/or confusing.

http://manishearth.github.io/blog/2017/01/14/stop-ascribing-...

The problem is mostly caused with how programming languages present text strings. There's usually a String class, with methods that say they manipulate characters. Usually, though, they either manipulate bytes, or manipulate code-points, and it's often not clear which.

Which is usually a symptom of a deeper problem: using the same ADT to represent "known-valid decoded text" and "a slice of bytes that would maybe decode to text in some unspecified encoding", such that the methods manipulating that ADT are completely incoherent.

Honestly, I'm surprised we don't see more programming languages like Objective-C, that have very clear distinctions between their "Data" type and "String" type, where encoded text is an NSData (a buffer of bytes) while decoded, valid text is an NSString, and all the methods on NSStrings operate on the grapheme clusters that decoded text is composed of, rather than on the bytes or codepoints or "characters" that are only relevant to encoded text.

Swift is especially good at this because almost all string ops are high level. Splitting is on EGCs (inherited from objc no doubt), and equality is normalized equality. You need to explicitly ask for other operations.