|
|
|
|
|
by est
4953 days ago
|
|
> Two code points, one letter. Yes I under stand there are million ways to display the same shape using various unicode. But how does that make code point counting impossible? AND if you explictly using COMBINING DIAERESIS instead of single U+00C4, counting diaeresis separately is wrong somehow? Why don't we make a law stating that both ae and æ is single letter? |
|
Just in case you don't, let's walk through it again.
UCS-16 big-endian represenation of Ä:
0x00 0x41 0x03 0x08
Another UCS-16 big-endian representation of Ä:
0x00 0xc4
If you look at the number of bytes, the first example has 4. It represents one letter. The second example has 2. It also represents one letter. Conclusion: UCS2 does not "count unicode characters faster than UTF8." You still have to look at every byte to see how many letters you have, same as in UTF-8.
Do you grasp this? If not, maybe you are one of those "ascii-centric ignorant morons" I keep hearing so much about.