|
|
|
|
|
by arcticbull
2231 days ago
|
|
This doesn't even begin to get into the question of what Levenshtein Distance even means in a Unicode context. What's the Levenshtein Distance of 3 emoji flags? I suppose we should be segmenting by grapheme clusters and utilizing a consistent normalization form when comparing, but Rust has no native support for processing grapheme clusters -- or for normalizations I believe. The UnicodeSegementation crate might help. Based on some cursory research, the go version differs in a more subtle way too. A Rune is a Code Point, which is a superset of the Rust "char" type; it includes surrogate pairs. |
|
Any higher-level abstract mention of Levenstein distances (e.g. of Unicode codepoints) is properly supposed to be taken to refer to the Levenstein distance of a conventional (or explicitly specified) binary encoding of the two strings.