|
|
|
|
|
by jesuscyborg
2039 days ago
|
|
Use wcspbrk. UTF-8 continuation characters are limited to the range \200 through \300 so there's basically zero chance that if you choose something like comma as your delimiter that it's going to tokenize the middle of a multibyte sequence. Also take into consideration that, under the hood, functions like strpbrk() are typically accelerated by CPU instructions such as PCMPISTRI which doesn't support UTF-8 natively but it does support UCS-2. |
|
Not just "basically;" there is no possible collision between ASCII characters and any valid multibyte encoding. This can be seen somewhat visually in this table[1] and is an intentional aspect of the UTF-8 design.
[1]: https://en.wikipedia.org/wiki/UTF-8#Encoding