| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by arcticbull 2239 days ago

Thanks for following up. Just as an FYI, there's a few bugs in your implementation, the most obvious one is the use of ".len()" in a number of places interspersed with ".chars().count()". These two return different values. ".len()" returns then number of UTF-8 bytes in the input string, which for ASCII is the same as ".chars().count()" obviously, but if you do attempt any Unicode characters, your function won't work. ".chars()" provides Unicode Scalar Values (USVs) -- which is a subset of code points, excluding surrogate pairs [1]. Note also this is not the same as a Go rune, which is a code point including surrogate pairs.

Secondly, you re-implemented "std::cmp::min" at the bottom of the file, and I'm not sure if the stdlib version is more optimized.

Lastly, well, you caught the issue with repeated passes over the string.

I've fixed the issues if you're curious: https://gist.github.com/martinmroz/2ff91041416eeff1b81f624ea...

Unrelated, I hate the term "fake news" as it's an intentional attempt to destroy the world public's faith in news media. It's a cancer on civilized society. Somewhere your civics teacher is crying into some whiskey, even though of course you're joking.

[1] http://www.unicode.org/glossary/#unicode_scalar_value

3 comments

arcticbull 2239 days ago

This doesn't even begin to get into the question of what Levenshtein Distance even means in a Unicode context. What's the Levenshtein Distance of 3 emoji flags? I suppose we should be segmenting by grapheme clusters and utilizing a consistent normalization form when comparing, but Rust has no native support for processing grapheme clusters -- or for normalizations I believe. The UnicodeSegementation crate might help.

Based on some cursory research, the go version differs in a more subtle way too. A Rune is a Code Point, which is a superset of the Rust "char" type; it includes surrogate pairs.

link

derefr 2239 days ago

Levenstein (edit) distance is fundamentally an information-theoretical concept defined on bitstreams, as insertions/deletions/swaps of individual bits within a stream. It has a lot in common with error-correcting codes, fountain codes, and compression, which all also operate on bitstreams.

Any higher-level abstract mention of Levenstein distances (e.g. of Unicode codepoints) is properly supposed to be taken to refer to the Levenstein distance of a conventional (or explicitly specified) binary encoding of the two strings.

link

grantwu 2239 days ago

Can you point to a source that defines Levenstein distance as only referring to bitstreams?

A translation of the original article [1] that introduced the concept notes in a footnote that "the definitions given below are also meaningful if the code is taken to mean an arbitrary set of words (possibly of different lengths) in some alphabet containing r letters (r >= 2)".

And if you wish to strictly stick to how it was originally defined, you'd need to only use strings of the same length.

More recent sources [2] say instead "over some alphabet", and even in the first footnote, describe results for "arbitrarily large alphabets"!

[1] https://nymity.ch/sybilhunting/pdf/Levenshtein1966a.pdf

[2] https://arxiv.org/pdf/1005.4033.pdf

link

arcticbull 2239 days ago

And Unicode is the biggest alphabet haha.

link

arcticbull 2239 days ago

The problem is there's not a "conventional" binary encoding of Unicode strings. You can create the same exact character many different ways, from a single scalar value to a composition of multiple scalar values. There's also multiple different ways of ordering different pieces of a composite character that yield the same value. Would we not want to utilize a consistent decomposition and consistent logical segmentation for the string? It's no longer enough to iterate over a string one byte at a time and derive any meaning. Is it right that the Levenstein distance between 2 equal characters, "a" and "a", might be 12, simply because the joiners were ordered differently?

It seems like segmentation by grapheme cluster and comparison using a consistent normalization would provide the same logical answer as a classic byte-wise Levenshtein distance on an ASCII string. [1]

Or are you suggesting that's too high level and we should just consider this to be operating on bit strings that happen to be grouped into bytes, and not worry about the logical implications. Therefore we'd just use a consistent normalization form on both input strings, and it's okay that the distance is up to like 10-15 for a single character difference in a composite character and 1 in an ASCII character. That sounds totally reasonable too, just different.

[1] http://unicode.org/reports/tr15/

link

klodolph 2239 days ago

> Any higher-level abstract mention of Levenstein distances (e.g. of Unicode codepoints) is properly supposed to be taken to refer to the Levenstein distance of a conventional (or explicitly specified) binary encoding of the two strings.

This doesn’t match any definition of Levenshtein distance that I’ve ever encountered. I’ve always seen it defined in terms of strings over some alphabet, and the binary case is just what happens when your alphabet only has two symbols in it.

Quite naturally the problem with Unicode strings is that there is are multiple ways to treat them as sequences. One obvious way is to treat them as a sequence of Unicode scalar values, but that’s by no means what you’d want—maybe a sequence of grapheme clusters may be more appropriate, and you also may wish to consider normalization.

link

afiori 2239 days ago

(related to the unrelated part) what if the media is corrupt? I mean, independently on current events (I really don't want to enter that here) we do live in a world where very few amoral corporations own most of the media industry.

If we (correctly) rely on the media to bring to public attentions relevant facts (both criminal and non-criminal) and keep a watchful eye on the nation who then keeps a watchful eye on the media?

is the model entirely based on always being there enough good journalist to spot the bad ones? how is this affected by the very precarious economics of current internet ads-based venture-funded media enterprises?

I just blurted too many questions... what I am trying to say is that similarly with the police there is not as easy answer in shoud-trust should-not-trust (in the US a supreme Court judge advised to "not talk to the police").

in that case I guess part of the problem is that the job of the police can be miscontrued as "arresting people". in the same way the job of a journalist can be miscontrued as "getting clicks"

overall I don't think we can pass an a priori moral judgement on that term, as essentially represent a statement that the default safety measures have failed.

(I want to reiterate that here I try not to intermingle my point with whether I believe or not that the current use is warranted, I am just trying to say that as a concept it needs to be part of an healthy democracy, the same as some distrust in electoral promises)

link

dcow 2239 days ago

"Fake News" entered the realm (not even recently, might I add) of popular misuse. Actually, is there a term for: language/grammar incorrectly used because society has developed a familiar "meme" use?

Common examples:

* Look at this dank "meme".

meme has come to mean "a picture shared on the internet that has words on it".

* Let's [have a] "cheers".

It's a toast. You say "cheers" when you toast.

* You missed Suzie and I's party last night.

It's Suzie and my party. This one is particularly annoying because it's made it way past editors and into writing, screenplay, etc.

link

Robin_Message 2239 days ago

If it's the kind of things people say, you want it in screenplays, otherwise Brooklyn 99 would sound like Shakespeare.

link