Hacker News new | ask | show | jobs
by arcticbull 2231 days ago
This doesn't even begin to get into the question of what Levenshtein Distance even means in a Unicode context. What's the Levenshtein Distance of 3 emoji flags? I suppose we should be segmenting by grapheme clusters and utilizing a consistent normalization form when comparing, but Rust has no native support for processing grapheme clusters -- or for normalizations I believe. The UnicodeSegementation crate might help.

Based on some cursory research, the go version differs in a more subtle way too. A Rune is a Code Point, which is a superset of the Rust "char" type; it includes surrogate pairs.

1 comments

Levenstein (edit) distance is fundamentally an information-theoretical concept defined on bitstreams, as insertions/deletions/swaps of individual bits within a stream. It has a lot in common with error-correcting codes, fountain codes, and compression, which all also operate on bitstreams.

Any higher-level abstract mention of Levenstein distances (e.g. of Unicode codepoints) is properly supposed to be taken to refer to the Levenstein distance of a conventional (or explicitly specified) binary encoding of the two strings.

Can you point to a source that defines Levenstein distance as only referring to bitstreams?

A translation of the original article [1] that introduced the concept notes in a footnote that "the definitions given below are also meaningful if the code is taken to mean an arbitrary set of words (possibly of different lengths) in some alphabet containing r letters (r >= 2)".

And if you wish to strictly stick to how it was originally defined, you'd need to only use strings of the same length.

More recent sources [2] say instead "over some alphabet", and even in the first footnote, describe results for "arbitrarily large alphabets"!

[1] https://nymity.ch/sybilhunting/pdf/Levenshtein1966a.pdf

[2] https://arxiv.org/pdf/1005.4033.pdf

And Unicode is the biggest alphabet haha.
The problem is there's not a "conventional" binary encoding of Unicode strings. You can create the same exact character many different ways, from a single scalar value to a composition of multiple scalar values. There's also multiple different ways of ordering different pieces of a composite character that yield the same value. Would we not want to utilize a consistent decomposition and consistent logical segmentation for the string? It's no longer enough to iterate over a string one byte at a time and derive any meaning. Is it right that the Levenstein distance between 2 equal characters, "a" and "a", might be 12, simply because the joiners were ordered differently?

It seems like segmentation by grapheme cluster and comparison using a consistent normalization would provide the same logical answer as a classic byte-wise Levenshtein distance on an ASCII string. [1]

Or are you suggesting that's too high level and we should just consider this to be operating on bit strings that happen to be grouped into bytes, and not worry about the logical implications. Therefore we'd just use a consistent normalization form on both input strings, and it's okay that the distance is up to like 10-15 for a single character difference in a composite character and 1 in an ASCII character. That sounds totally reasonable too, just different.

[1] http://unicode.org/reports/tr15/

> Any higher-level abstract mention of Levenstein distances (e.g. of Unicode codepoints) is properly supposed to be taken to refer to the Levenstein distance of a conventional (or explicitly specified) binary encoding of the two strings.

This doesn’t match any definition of Levenshtein distance that I’ve ever encountered. I’ve always seen it defined in terms of strings over some alphabet, and the binary case is just what happens when your alphabet only has two symbols in it.

Quite naturally the problem with Unicode strings is that there is are multiple ways to treat them as sequences. One obvious way is to treat them as a sequence of Unicode scalar values, but that’s by no means what you’d want—maybe a sequence of grapheme clusters may be more appropriate, and you also may wish to consider normalization.