Hacker News new | ask | show | jobs
by nabla9 3400 days ago
>Character Sets: a collection of characters associated with numeric values. These pairings are called “code points”.

This is very ambiguous definition and it can be very confusing. I sure there are many people who have read Joel Spolsky's Unicode intro and left confused.

Using ASCII as an example is confusing because ASCII character maps into several different Unicode concepts:

1. byte

2. code point

3. encoded character

4. grapheme

5. grapheme cluster

6. abstract character

7. user perceived character

Mapping from user perceived character to abstract characters is not total,injective, or surjective. Some abstract characters need more than one code point to express them. You can't split sequence of Unicode code points arbitrarily in code point boundaries, you must use grapheme clusters instead.

2 comments

> 1. byte

Actually 'code unit', which may have any number of bits, depending on the encoding. Otherwise spot on.

it's a practical guide, not comprehensive -- for most people this is a great start, especially if they are familiar with ASCII

you'll notice I didn't cover collation -- why? because explaining that would dilute process of understand UTF-x and UNICODE

It's pedagogically wrong and extremely misleading.

When people start with introduction like this, they end up thinking they have learned more than they actually have.

I point this out because I was one of those mislead by several previous articles explaining Unicode encoding the same wrong way as you did. When I ran into trouble and asked help, everybody around me was misguided the same way. I didn't know I had to dig into manuals because everybody explained that this is how it is. Then I had to teach everyone else that we had all learned it wrong.

Many people deal only with ascii or other easy western alphabets and they can work years with Unicode before they hit into trouble.