|
If you think the Unicode flag emoji take a lot of bytes, then consider the family emoji! (https://unicode.org/emoji/charts/full-emoji-list.html#family) I'm in the process of designing a scripting language and implementing it in C++. I plan to put together a YouTube series about it. (Doesn't everyone want to see Bison and Flex mixed with proper unit tests and C++20 code?) Due to my future intended use case, I needed good support for Unicode. I thought that I could write it myself, and I was wrong. I wasted two weeks (in my spare time, mostly evenings) trying to cobble together things that should work, identifying patterns, figuring out how to update it as Unicode itself is updated, thinking about edge cases, i18n, zalgo text, etc. And then I finally reached the point where I knew enough to know that I was making the wrong choice. I'm now using ICU. (https://icu.unicode.org/) It's huge, it was hard to get it working in my environment, and there are very few examples of it's usage online, but after the initial setup dues are paid, it WORKS. Aside: Yes, I know I'm crazy for implementing a programming language that I intend for serious usage. Yes, I have good reasons for doing it, and yes I have considered alternatives. But it's fun, so I'm doing it anyways. Moral of the story: Dealing with Unicode is hard, and if you think it shouldn't be that hard, then you probably don't know enough about the problem! |
- Counting, rendering and collapsing grapheme clusters (like the flag emoji)
- Converting between legacy encodings (shiftjis, ko8, etc) and UTF-8 / UTF-16
- Canonicalization
If all you need is to deal with utf8 byte buffers, you don't need all that stuff. And your code can stay simple, small and fast.
IIRC the rust standard library doesn't bother supporting any of the hard parts in unicode. The only real unicode support in std is utf8 validation for strings. All the complex aspects of unicode are delegated to 3rd party crates.
By contrast, nodejs (and web browsers) do all of this. But they implement it in the same way you're suggesting - they simply call out to libicu.