Hacker News new | ask | show | jobs
De-duplicating Hacker News (shishirprasad.com)
67 points by shishir456 3990 days ago
16 comments

De-duplicating might have a couple of downsides:

1) Not all good stories get taken up the first time.

2) Not everyone reads every story on HN.

After a couple of years of reading HN, I'm happy to see quality posts reappear. I often glean new insights each time. If it's interesting to enough people, it'll bubble up again. The beauty of HN's "gravity" is that anything that's universally boring will disappear quickly.

"Anything that gratifies one's intellectual curiosity" needn't mean, "Hasn't been posted ever before."

If, however, a de-duplicator could automatically provide references to all previous HN discussions on the same topic, that'd be very cool.

I may be mistaken but the issue the author is concerned with is de-duplicating breaking stories, not preventing the reposting of items from the long tail. I don't think he is trying to prevent reposting evergreen content.
Deduplication on the current stories was simple to implement as a first hack and that is why I did a prototype for it. But it is very easy to extend this idea to stories across time. All we need is to maintain an index of all hacker news stories and the same approach can very easily be implemented to prevent reposting old content too.
Preventing the reposting of old content is undesirable. A story that was posted "2516 days ago" has zero value to the community today at most. If today it prevents a posting that provokes meaningful dialog the old story is likely detrimental. HN is a very different community than it was five years ago.
On the other hand, stories which would otherwise overwhelm the front page could automatically be merged into a single aggregate entry in one slot. As the community grows and diversifies, having another level of abstraction above "story" (which on other sites is taken up by "boards" or "tags") could be useful for organizing that content without disrupting the site too much.
One place a tool like this would be useful is on submission. If I post a link and something comes up that says "HEY! This was last posted 2 days ago and got 300 upvotes and had 45 comments, here's the link.", that would discourage reposting when the poster didn't know the article had been posted before.
HN does this but only when the link is same...

you can have people post similar stories from different media's and if they get picked at different times then this could be useful.

I thought it already did that.
Oh, maybe. I guess I've never submitted a link before to know that. Even still, if that algorithm can be improved, even slightly, it'd be better for curbing repeat posts.
Perhaps the de-duplicator as described in the OP could operate on posts within the same X-hour period (where X is 24-48 somewhere). I agree with you that automatically referencing previous postings of the same content would be great for older duplicates.
> Compute pair-wise Jaccard similarity of all articles with each other and output the articles whose title similarity is greater than 0.5.

If I understand correctly, it checks the similarity between titles. How effective it will be if it also checks similarities between the article contents? Sometimes, the article may not have similar looking titles, but talking about same thing.

Example: [0], [1], [2]

OT: You can use Python haxor to get articles from the HN [3]. Disclaimer: I wrote it.

[0] - http://www.bloomberg.com/news/articles/2015-07-17/google-app...

[1] - http://www.cnbc.com/2015/07/17/googles-one-day-rally-is-the-...

[2] - http://www.bbc.com/news/business-33572959

[3] - http://github.com/avinassh/haxor

My understanding of the article is that this is more about reducing duplication on breaking stories rather than preventing periodic reposting of long tail and evergreen content. That is, the proposal appears focused on the "news" end of the spectrum rather than the "feature" end.

It's a reasonable request but one I tend to disagree with. Funneling all attention on one version of a breaking story reduces nuance and diversity of perspective. It also rewards fastest posting over finding the best content while taking the community out of the calculus of story value. Finally, to me the dispersion of karma awards across multiple submissions is a feature not a bug...karma should tend toward rewarding quality rather than timing where an explicit mechanism is in operation.

Finally, there are occasions where a story is deemed to merit the full attention of the community. The deaths of Steve Jobs and Dennis Ritchie are examples.

I think this is a bad idea. Last week I posted a link here and by random chance no one saw it, and it didn't get any points.

A few days later the same story ended up at the top of /r/machinelearning. At least 10 different people tried to post it. They all ended up at my post which was days old and dead. If HN just allowed resubmissions, it probably would have ended up on the front page that day.

how do you know that?
Posts with >5 upvotes tend to go to the front page for at least a few minutes. More than that many people tried to repost it.
What was the post? I'd be curious to look into it, if you don't mind saying. (Edit: might be best to email hn@ycombinator.com.)
Ok, there was another factor. When a site is the source of many stories marked lightweight, it eventually gets penalized as a lightweight site. We have software that does this, plus moderators do it.

This was the case here: that site has been the source of, not spam exactly (which is why it isn't banned), but a lot of unsubstantive and/or derivative articles. The post you submitted is an example of the latter, since it was derived from a Reddit thread.

The penalties I'm talking about don't make it impossible for a story to get traction, but they do set the bar higher. So you were right that it was randomness, but the randomness was also skewed.

If people try to submit an already posted URL their submission is blocked and they are redirected to the original URL which gets an upvote. So if you see a flurry of upvotes on one of your old submissions it's possibly because people have started trying to post the URL.

I think thT's how it works.

Looks like the site's been hacked now.
Interesting approach, a slightly simpler approach is to just take the MD5 hash of paragraphs. Two paragraphs with the same hash are likely identical, and two articles with 2 or more identical paragraphs are likely a dupe.

So as a suggestion try that algorithm with your current infrastructure and let us know how it compares to the Jaccard similarity test.

Some blog have standard end paragraph like "If you have read all of this, you may like to subscribe to my rss", or "We are always hiring at ABC, send your resume." Another problem are short captions that look like a paragraph for the html parser, like "Advertisment" or "XYZ Benchmark (higher is better)". One possible solution is to skip the paragraphs that have less than ¿150? letters.
I agree that it is quite reasonable to ignore paragraphs that are fewer than 3 sentences.
Nice, but do remember that dang has asked people to resubmit some links. Which is an interesting variable of did the initial link actually gain traction before it sunk into obscurity with no adding of comments.
Duplicate content is de-duplicated to an extent. Oftentimes, if I forget where the HN story that corresponds to a page I'm reading, I resubmit it, and HN redirects me to the HN story.

I believe this is time-boxed by a few weeks, which allows for things like "Something interest from the past (2008)" to get posted again.

If duplicates bother you, this is a sure sign you're spending too much time on HN.
The problems mentioned can get real. Other simple solutions could be:

- Allow to merge threads using special comments. Once a comment with a merge request gets 'a lot' upvotes / more than the corresponding thread, it gets merged.

- Add an 'also on HN'-block with all threads linked inside the comments.

- Allow to create compositions in submit-function. This could also create a potential for meta-HN-content, e.g. 'links with great discussions'.

IMHO: HN is nice because it does not have a lot on functionality. It just works. Complicating things isn't a solution. It's human-driven in contrast to an automatic news aggregator and it should stay human-driven.

I'd just hope that the duplicate threads are merged together so I can see all comments because comments can't be duplicated.

I appreciate members that are solving the duplicate problem manually by pasting the links to the same post. May be there can be "Report as duplicate" so that admin can merge them together.

When it comes to changing things on HN, I think something like comment collapsing would be much more helpful than de-dup (as some of the other comments mention), though I know there are extensions that help out.
That is, apparently, eventually, on the way, as is a mobile-friendly layout.

The staff are probably not looking forward to the day they deploy it and half the users lose their minds with impotent rage because the shibboleth of awkward UX is no longer there to drive away the casuals. Heaven help us if they implement thread folding in javascript...

An optional https://m.news.ycombinator.com/ with bigger font, and bigger vote buttons with better separation between them, would allow them to test how many users want a different UI.
We're not changing the look and feel of HN. New markup will look the same, just work better.
one issue i see is not the dupes but is that its decently easy to self-vote from several "accounts" and "ips" to be on the FP for a little while.

the other one would be that due to the fast-paced news, only a few per day are really interesting - and you dont look every hour you might even miss it (which is either way not very productive)

It ain't broke.
url seems broke can you check
Ya seems to have been hacked or something :(. Let me look into it !! The parent site works and you can read the article there : http://shishirprasad.com/
Why can you not just use the canonicalized URL to detect dupes? That is infinitely simpler than doing text analysis.
It will work for simple cases like https vs http or other cases of URL normalization but won't work for complex cases where they refer to the same content but with different title.
I think it could work with the canonical tag[1], not the url itself.

[1] http://googlewebmastercentral.blogspot.com.ar/2009/02/specif...