|
|
|
|
|
by nabla9
3400 days ago
|
|
>Character Sets: a collection of characters associated with numeric values. These pairings are called “code points”. This is very ambiguous definition and it can be very confusing. I sure there are many people who have read Joel Spolsky's Unicode intro and left confused. Using ASCII as an example is confusing because ASCII character maps into several different Unicode concepts: 1. byte 2. code point 3. encoded character 4. grapheme 5. grapheme cluster 6. abstract character 7. user perceived character Mapping from user perceived character to abstract characters is not total,injective, or surjective. Some abstract characters need more than one code point to express them. You can't split sequence of Unicode code points arbitrarily in code point boundaries, you must use grapheme clusters instead. |
|
Actually 'code unit', which may have any number of bits, depending on the encoding. Otherwise spot on.