| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fruchtose 5189 days ago

> We work immensely had to product original content as well as link to original sources when deserved...this case was absolutely no different.

Zee, let me tell you something. When you lie, do not lie in a way that can be disproven by math. I mean, it was easy, so easy for me to take five minutes to plug in the article text to a difference calculator and find the result. And here's the outcome:

  There is a 58.632778264680105% difference between Unwieldy's article and TheNextWeb's (41.367221735319895.% similar).

  If you remove the intro from TheNextWeb's article, there is a 28.213166144200624% difference between Unwieldy's article and TheNextWeb's (71.78683385579937.% similar).

Do you want to say the words "original content" again? Because I just determined that your article is at least 41% similar to the one you copied. If we remove the cute introduction, the sameness of your article jumps to over 70% . I used the well-known algorithm called Levenshtein distance. It took me a minute to figure out how I would determine how much TNW's article was plagiarized. And because I do not copy, I will even show you how I got it.

First, here are the articles I compared (text only, line breaks removed): http://notes.unwieldy.net/post/23049725899/plagiarism and http://thenextweb.com/shareables/2012/05/14/how-3-simple-but...

Here's the code I used: http://ideone.com/BdNk2 (Java)

Here's the code I used to calculate the Levenshtein distance: http://en.wikibooks.org/wiki/Algorithm_Implementation/String...

And here's the technique I used to calculate the Levenshtein difference percentage (thanks to Alex Martelli): http://stackoverflow.com/questions/3106994/algorithm-to-calc...

Now, what could you have done? You could actually admit there was no way you could produce "original content" from copying the original article unless you did actual research beyond what Joshua Gross found. You could merely post a link to the article and say, "This is cool. Check this out." And third, you could be nice on Twitter to the author you shamelessly ripped off. My god, when I can show 41% of your article to be the same as another, the least you should do is be 41% classy about it.

2 comments

jacquesm 5189 days ago

And, to boot is still in damage control mode, now that pretending ignorance does not work we're going for the 'unfortunate timing angle'.

Nice work on the distance calculation, I think you've just figured out a way to create a blogspam detector, if an article is linked from a newer article and there is a > X% (with X somewhere in the neighbourhood of 45%) or so similarity then it is blogspam.

link

fruchtose 5189 days ago

Thanks, but the distance calculation is the work of Alex Martelli from Stack Overflow. It's one of the sources I cited: http://stackoverflow.com/questions/3106994/algorithm-to-calc... I cited. It's simple enough that I probably could have thought of it on my own if I spent more time on it, but then again it was simple enough to find with a Google search.

link

scoot 5189 days ago

I could have sworn it was the work of Vladimir Levenshtein: http://en.wikipedia.org/wiki/Vladimir_Levenshtein

link

fruchtose 5189 days ago

The code I referenced measures the difference between strings (percentages), using Levenshtein distance--which states the number of changes between two strings. If you can find a source that states this idea of difference can be attributed to Levenshtein, then by all means I will acknowledge him. Until then, I will refer to Alex Martelli's code.

link

scoot 5188 days ago

The code I referenced measures the difference between strings

Exactly, and the algorithm used is, as I said, attributed to Levenshtein. Expressing it as a ratio is hardly novel.

As for the implementation, Alex Martelli credit's Stavros Korokithakis[1], although Lev implementations are 2-a-penny, and this isn't a particularly good one (sorry Stavros).

[1]http://www.korokithakis.net/posts/finding-the-levenshtein-di...

link

fruchtose 5188 days ago

> Exactly, and the algorithm used is, as I said, attributed to Levenshtein.

I cited Levenshtein by name in my original comment. I'm guessing you didn't read everything what I wrote, because I don't understand why you would think there's an issue otherwise.

> Expressing it as a ratio is hardly novel.

The whole reason I linked to Alex Martelli's post is because it's his work, not mine, novel or otherwise. I just cited the resourced I used.

link

javajosh 5189 days ago

It is nice to put a number on it, and perhaps this will be the straw that breaks the camel's back, but given Zee's initial reaction to what is clearly plagarism (using just plain-old eyeballing) it's doubtful that putting a number on the sameness will do anything for this camel's back.

This is just another one of those, "A young CEO with too much balls and not enough brains shoots his mouth off, revealing to all that he is not a Good Person." Of course, CEOs not being Good People is nothing new, but the refusal to even try to pretend to be one is something that I think really angers people.

link