Hacker News new | ask | show | jobs
by johnboyer 2881 days ago
Unicode is supported with UTF-8, only different character types aren't supported because generics are messy in C. I thought of accepting void* to support wchar_t and others, but some performance penalties came with it so I decided against it.
1 comments

No, having different character types (I believe you are referring to C11's `char16_t` and `char32_t`?) is not a requirement for Unicode support. At the very least you need to have a single function or two that...

* Receives a string expected to be encoded in UTF-8, and an offset to it expected to be a UTF-8 sequence boundary.

* Scans forward or backward for the next or previous UTF-8 sequence boundary.

* Optionally returns the code point for the scanned UTF-8 sequence.

* Has proper error handling for every imaginable cases: out of boundary, not a boundary, not a valid UTF-8 sequence. (OOB case needs to be handled because it will be the end condition of the iteration. Preferably should be distinct from other error conditions.)

Every other functionality can build upon this little function, in particular the iteration and UTF-8 validation will be trivial. The full Unicode support including case mapping, folding, normalization and property lookup will of course require a not-so-small table but is not strictly necessary anyway.

Björn Höhrmann's Flexible and Economical UTF-8 Decoder [1] will be handy for a concise implementation.

[1] https://bjoern.hoehrmann.de/utf-8/decoder/dfa/

I don't see any functions in the OP's library that would require dedicated UTF-8 handling. The string length is given in bytes, not characters or codepoints. There's no functionality to give you the character at n-th location etc... you can easily implement all UNICODE-specific functionality in a separate library and use it together with the OPs library. IMHO that's even preferable.
Yes, but don't call it string library then. Strings should handle strings, and strings are unicode now. Unicode needs to be normalized and needs case-insensitive support.

And it's not easy. I implemented the third of its kind. First there was ICU, which is overly bloated. You don't need 30MB for a simple string libc. Then there is libunistring which has overly slow iterators, so not usable for coreutils. And then there's my safelibc, which is small and fast, but only for wide-chars, not utf-8.

I fixed and updated the musl case-mapping, making it 2x faster, but this is not in yet. And there's not even a properly spec'ed wcscmp/wcsicmp to find strings. glibc is an overall mess. I won't touch that. wcsicmp/wcsfc/wcsnorm are not even in POSIX.

How does the utf8proc[1] library that Julia uses compare to these?

[1] http://juliastrings.github.io/utf8proc/doc/

Why try to redefine the word "string?"

In computer jargon I believe CISC and the PDP-11 have seniority. That's why all multi-word functions like memcpy are in C's string.h header.

Hey, even C contains a locale-dependent string comparison, namely `strcoll` (since 1990!).

I admit two words "string" and "text" are now interchangable. But that doesn't make strings have less requirements, people are just expecting more out of strings.