|
|
|
|
|
by johnboyer
2881 days ago
|
|
Unicode is supported with UTF-8, only different character types aren't supported because generics are messy in C. I thought of accepting void* to support wchar_t and others, but some performance penalties came with it so I decided against it. |
|
* Receives a string expected to be encoded in UTF-8, and an offset to it expected to be a UTF-8 sequence boundary.
* Scans forward or backward for the next or previous UTF-8 sequence boundary.
* Optionally returns the code point for the scanned UTF-8 sequence.
* Has proper error handling for every imaginable cases: out of boundary, not a boundary, not a valid UTF-8 sequence. (OOB case needs to be handled because it will be the end condition of the iteration. Preferably should be distinct from other error conditions.)
Every other functionality can build upon this little function, in particular the iteration and UTF-8 validation will be trivial. The full Unicode support including case mapping, folding, normalization and property lookup will of course require a not-so-small table but is not strictly necessary anyway.
Björn Höhrmann's Flexible and Economical UTF-8 Decoder [1] will be handy for a concise implementation.
[1] https://bjoern.hoehrmann.de/utf-8/decoder/dfa/