Hacker News new | ask | show | jobs
by chrfrasco 2231 days ago
Hey all, as some keen-eyed commenters have pointed out, it looks like the rust program is not actually equivalent to the go program. The go program parses the string once, while the rust program parses it repeatedly inside every loop. It's quite late in Sydney as I write this so I'm not up for a fix right now, but this post is probably Fake News. The perf gains from jemalloc are real, but it's probably not the allocators fault. I've updated the post with this message as well.

The one-two combo of 1) better performance on linux & 2) jemalloc seeming to fix the issue lured me into believing that the allocator was to blame. I’m not sure what the lesson here is – perhaps more proof of Cunningham’s law? https://en.wikipedia.org/wiki/Ward_Cunningham#Cunningham's_L...

3 comments

Thanks for following up. Just as an FYI, there's a few bugs in your implementation, the most obvious one is the use of ".len()" in a number of places interspersed with ".chars().count()". These two return different values. ".len()" returns then number of UTF-8 bytes in the input string, which for ASCII is the same as ".chars().count()" obviously, but if you do attempt any Unicode characters, your function won't work. ".chars()" provides Unicode Scalar Values (USVs) -- which is a subset of code points, excluding surrogate pairs [1]. Note also this is not the same as a Go rune, which is a code point including surrogate pairs.

Secondly, you re-implemented "std::cmp::min" at the bottom of the file, and I'm not sure if the stdlib version is more optimized.

Lastly, well, you caught the issue with repeated passes over the string.

I've fixed the issues if you're curious: https://gist.github.com/martinmroz/2ff91041416eeff1b81f624ea...

Unrelated, I hate the term "fake news" as it's an intentional attempt to destroy the world public's faith in news media. It's a cancer on civilized society. Somewhere your civics teacher is crying into some whiskey, even though of course you're joking.

[1] http://www.unicode.org/glossary/#unicode_scalar_value

This doesn't even begin to get into the question of what Levenshtein Distance even means in a Unicode context. What's the Levenshtein Distance of 3 emoji flags? I suppose we should be segmenting by grapheme clusters and utilizing a consistent normalization form when comparing, but Rust has no native support for processing grapheme clusters -- or for normalizations I believe. The UnicodeSegementation crate might help.

Based on some cursory research, the go version differs in a more subtle way too. A Rune is a Code Point, which is a superset of the Rust "char" type; it includes surrogate pairs.

Levenstein (edit) distance is fundamentally an information-theoretical concept defined on bitstreams, as insertions/deletions/swaps of individual bits within a stream. It has a lot in common with error-correcting codes, fountain codes, and compression, which all also operate on bitstreams.

Any higher-level abstract mention of Levenstein distances (e.g. of Unicode codepoints) is properly supposed to be taken to refer to the Levenstein distance of a conventional (or explicitly specified) binary encoding of the two strings.

Can you point to a source that defines Levenstein distance as only referring to bitstreams?

A translation of the original article [1] that introduced the concept notes in a footnote that "the definitions given below are also meaningful if the code is taken to mean an arbitrary set of words (possibly of different lengths) in some alphabet containing r letters (r >= 2)".

And if you wish to strictly stick to how it was originally defined, you'd need to only use strings of the same length.

More recent sources [2] say instead "over some alphabet", and even in the first footnote, describe results for "arbitrarily large alphabets"!

[1] https://nymity.ch/sybilhunting/pdf/Levenshtein1966a.pdf

[2] https://arxiv.org/pdf/1005.4033.pdf

And Unicode is the biggest alphabet haha.
The problem is there's not a "conventional" binary encoding of Unicode strings. You can create the same exact character many different ways, from a single scalar value to a composition of multiple scalar values. There's also multiple different ways of ordering different pieces of a composite character that yield the same value. Would we not want to utilize a consistent decomposition and consistent logical segmentation for the string? It's no longer enough to iterate over a string one byte at a time and derive any meaning. Is it right that the Levenstein distance between 2 equal characters, "a" and "a", might be 12, simply because the joiners were ordered differently?

It seems like segmentation by grapheme cluster and comparison using a consistent normalization would provide the same logical answer as a classic byte-wise Levenshtein distance on an ASCII string. [1]

Or are you suggesting that's too high level and we should just consider this to be operating on bit strings that happen to be grouped into bytes, and not worry about the logical implications. Therefore we'd just use a consistent normalization form on both input strings, and it's okay that the distance is up to like 10-15 for a single character difference in a composite character and 1 in an ASCII character. That sounds totally reasonable too, just different.

[1] http://unicode.org/reports/tr15/

> Any higher-level abstract mention of Levenstein distances (e.g. of Unicode codepoints) is properly supposed to be taken to refer to the Levenstein distance of a conventional (or explicitly specified) binary encoding of the two strings.

This doesn’t match any definition of Levenshtein distance that I’ve ever encountered. I’ve always seen it defined in terms of strings over some alphabet, and the binary case is just what happens when your alphabet only has two symbols in it.

Quite naturally the problem with Unicode strings is that there is are multiple ways to treat them as sequences. One obvious way is to treat them as a sequence of Unicode scalar values, but that’s by no means what you’d want—maybe a sequence of grapheme clusters may be more appropriate, and you also may wish to consider normalization.

(related to the unrelated part) what if the media is corrupt? I mean, independently on current events (I really don't want to enter that here) we do live in a world where very few amoral corporations own most of the media industry.

If we (correctly) rely on the media to bring to public attentions relevant facts (both criminal and non-criminal) and keep a watchful eye on the nation who then keeps a watchful eye on the media?

is the model entirely based on always being there enough good journalist to spot the bad ones? how is this affected by the very precarious economics of current internet ads-based venture-funded media enterprises?

I just blurted too many questions... what I am trying to say is that similarly with the police there is not as easy answer in shoud-trust should-not-trust (in the US a supreme Court judge advised to "not talk to the police").

in that case I guess part of the problem is that the job of the police can be miscontrued as "arresting people". in the same way the job of a journalist can be miscontrued as "getting clicks"

overall I don't think we can pass an a priori moral judgement on that term, as essentially represent a statement that the default safety measures have failed.

(I want to reiterate that here I try not to intermingle my point with whether I believe or not that the current use is warranted, I am just trying to say that as a concept it needs to be part of an healthy democracy, the same as some distrust in electoral promises)

"Fake News" entered the realm (not even recently, might I add) of popular misuse. Actually, is there a term for: language/grammar incorrectly used because society has developed a familiar "meme" use?

Common examples:

* Look at this dank "meme".

meme has come to mean "a picture shared on the internet that has words on it".

* Let's [have a] "cheers".

It's a toast. You say "cheers" when you toast.

* You missed Suzie and I's party last night.

It's Suzie and my party. This one is particularly annoying because it's made it way past editors and into writing, screenplay, etc.

If it's the kind of things people say, you want it in screenplays, otherwise Brooklyn 99 would sound like Shakespeare.
> Hey all, as some keen-eyed commenters have pointed out, it looks like the rust program is not actually equivalent to the go program. The go program parses the string once, while the rust program parses it repeatedly inside every loop. […] The perf gains from jemalloc are real, but it's probably not the allocators fault. I've updated the post with this message as well.

I don't know that it would be a gain: Rust is pretty good at decoding UTF8 quickly given how absolutely fundamental that operation is, and "caching" the decoded data would increase pressure on the allocator.

Unless you also changed the interface of the levenshtein to hand it preallocated cache, source and destination buffers (or the caller did that).

edit: to the downvoter, burntsushi did the legwork of actually looking at this[0] and found caching the decoding to have no effect at best, unless the buffers get lifted out of the function entirely, which matches my comment's expectations.

[0] https://news.ycombinator.com/item?id=23059753

> But yes, I did benchmark this, even after reusing allocations, and I can't tell a difference. The benchmark is fairly noisy.

> this post is probably Fake News

It's not Fake News. Fake News is the publication of intentionally false stories. This is just erroneous.

There's a yawning chasm between the two.

"fake news" is an attempt at permanently damaging the American public's faith in the fifth estate. Nobody should ever use that term in the current political climate, ever.
Says who?

When news organizations take other news organizations word for it and the story is false, that's fake news. We called it something different back then, but fake news led to the invasion of Iraq. Negligence is sufficient for fake news, malice not required.

Fake News was a term invented around 2015 during the run-up to the 2016 election. It specifically referred to the phenomenon of literally fake news stories, such as rallies and completely false stories about politicians, being published by legitimate-sounding but nonexistent news organizations, using platforms like Facebook to disseminate themselves. Usually these were done from China and Russia.
You're as silly as those people who claim that male and female refer exclusively to sex but not gender. Terms change in meaning and fake news is widely used to refer to false news stories, regardless of reason. You have to accept the way everybody else is using the word.
So, what do you recommend calling what used to be called fake news? And how we ensure that the its replacement isn't co-opted for malicious purposes again?
And then trump redefined it to be about legitimate news sources exclusively. We’ve really been in upside down world for four years.
That's one of the reasons why I bring it up. If we don't stand against this dilution of a powerful term, it will be co-opted - which is exactly what people who want to destroy the trust of the press for their own personal and political ends want to do.
It’s a joke mate
Of course, but it chips away at the Overton window, normalizing it.