| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Tronic2 2264 days ago
	Ignore all character support in the standard library and handle UTF-8 as opaque binary buffers. If you need complex string algorithms, decode into UCS-4 (UTF-32). You'll find short encoding and decoding functions on StackOverflow. For case-insensitive comparisons and sorting, use an external library that knows the latest Unicode standard.

1 comments

barbegal 2264 days ago

Except that not all binary data is valid UTF-8 so you also need functions that check if a binary buffer is valid UTF-8.

link

Tronic2 2264 days ago

The decoding phase will do that, if needed. Also note that in many cases you must process it as opaque binary, even though it should be valid UTF-8. This is in particular with filenames on POSIX systems because otherwise you could not access any files that happen to have invalid UTF-8 in their names.

link