|
|
|
|
|
by Thiez
2803 days ago
|
|
While it sure is possible to do text manipulation in C, I don't think it should ever be the first choice, even if 'fastest' is a goal. A 0 byte is perfectly acceptable in a utf8 string (or any unicode string, really). But C has those annoying zero-terminated strings, so if you want to manipulate arbitrary unicode strings the first thing you can do is kiss the string functions in the C standard library goodbye. Which you probably want to do anyway because pascal-strings are simply better. I would use Rust or C++ for this task. |
|
What? My understanding was that utf8 was crafted specifically so that the only null byte in it was literally NUL. That all normal human language described by a utf8 string will never contain a NUL. They're comparable to C strings in that way, where it can be used safely as an end of string marker. If you have embedded NULs, it's not really utf8, is it?