|
|
|
|
|
by kbolino
291 days ago
|
|
In order to implement grapheme cluster segmentation, you have to start with a sequence of Unicode scalars. In practice, that means a sequence of 32-bit integers, which is UTF-32 in all but name. It's not a good interchange format, but it is a necessary intermediate/internal format. There's also the problem that grapheme cluster boundaries change over time. Unicode has become a true mess. |
|