| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by klodolph 2237 days ago

> Any higher-level abstract mention of Levenstein distances (e.g. of Unicode codepoints) is properly supposed to be taken to refer to the Levenstein distance of a conventional (or explicitly specified) binary encoding of the two strings.

This doesn’t match any definition of Levenshtein distance that I’ve ever encountered. I’ve always seen it defined in terms of strings over some alphabet, and the binary case is just what happens when your alphabet only has two symbols in it.

Quite naturally the problem with Unicode strings is that there is are multiple ways to treat them as sequences. One obvious way is to treat them as a sequence of Unicode scalar values, but that’s by no means what you’d want—maybe a sequence of grapheme clusters may be more appropriate, and you also may wish to consider normalization.