Hacker News new | ask | show | jobs
by johncolanduoni 3297 days ago
And those string algorithms likely break in subtle ways when they handle characters that span multiple codepoints.
2 comments

> break in subtle ways when they handle characters that span multiple codepoints

Or equivalently: there is more than one way to turn a string into a list. It can e.g. be a sequence of bytes, unicode chars or grapheme clusters. Being explicit about the conversion is therefore a good idea.

Don't forget splitting on word boundaries and/or whitespace - going from a string of text to an iterable collection of words (strings).
Or for the case of (e.g.) domain names, splitting on dots. Generally, given a collection of split chars, breaking the string into a collection of substrings.
Not if the "iterating over character" function iterates over actual characters and not codepoints.
You mean grapheme clusters? Swift is the only language I know that uses that by default, and you still wouldn't want to store strings as a list of grapheme clusters.
I believe Perl 6 does so as well, see e.g. https://perl6advent.wordpress.com/2015/12/07/day-7-unicode-p...
The apple dev documentation has a nice overview of some of the concerns that need to be taken into account for this to work:

https://developer.apple.com/library/content/documentation/Co...

It's probably one of the better approaches - but it's still not clear if it (alone) allows a developer that speaks only English to develop a text indexing or editing system that works well across English, Japanese, Arabic, Hangul and Dutch for example.

Elixir as well.
The problem is “actual characters” are an ill-defined term; that could mean either code points or graphemes. See, e.g., http://unicode.org/faq/char_combmark.html