| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by johncolanduoni 3297 days ago
	And those string algorithms likely break in subtle ways when they handle characters that span multiple codepoints.

2 comments

SideburnsOfDoom 3297 days ago

> break in subtle ways when they handle characters that span multiple codepoints

Or equivalently: there is more than one way to turn a string into a list. It can e.g. be a sequence of bytes, unicode chars or grapheme clusters. Being explicit about the conversion is therefore a good idea.

link

e12e 3296 days ago

Don't forget splitting on word boundaries and/or whitespace - going from a string of text to an iterable collection of words (strings).

link

SideburnsOfDoom 3296 days ago

Or for the case of (e.g.) domain names, splitting on dots. Generally, given a collection of split chars, breaking the string into a collection of substrings.

link

adrianN 3297 days ago

Not if the "iterating over character" function iterates over actual characters and not codepoints.

link

johncolanduoni 3297 days ago

You mean grapheme clusters? Swift is the only language I know that uses that by default, and you still wouldn't want to store strings as a list of grapheme clusters.

link

cannam 3297 days ago

I believe Perl 6 does so as well, see e.g. https://perl6advent.wordpress.com/2015/12/07/day-7-unicode-p...

link

e12e 3296 days ago

The apple dev documentation has a nice overview of some of the concerns that need to be taken into account for this to work:

https://developer.apple.com/library/content/documentation/Co...

It's probably one of the better approaches - but it's still not clear if it (alone) allows a developer that speaks only English to develop a text indexing or editing system that works well across English, Japanese, Arabic, Hangul and Dutch for example.

link

jswny 3296 days ago

Elixir as well.

link

dragonwriter 3296 days ago

The problem is “actual characters” are an ill-defined term; that could mean either code points or graphemes. See, e.g., http://unicode.org/faq/char_combmark.html

link