Hacker News new | ask | show | jobs
by ISL 3990 days ago
De-duplicating might have a couple of downsides:

1) Not all good stories get taken up the first time.

2) Not everyone reads every story on HN.

After a couple of years of reading HN, I'm happy to see quality posts reappear. I often glean new insights each time. If it's interesting to enough people, it'll bubble up again. The beauty of HN's "gravity" is that anything that's universally boring will disappear quickly.

"Anything that gratifies one's intellectual curiosity" needn't mean, "Hasn't been posted ever before."

If, however, a de-duplicator could automatically provide references to all previous HN discussions on the same topic, that'd be very cool.

4 comments

I may be mistaken but the issue the author is concerned with is de-duplicating breaking stories, not preventing the reposting of items from the long tail. I don't think he is trying to prevent reposting evergreen content.
Deduplication on the current stories was simple to implement as a first hack and that is why I did a prototype for it. But it is very easy to extend this idea to stories across time. All we need is to maintain an index of all hacker news stories and the same approach can very easily be implemented to prevent reposting old content too.
Preventing the reposting of old content is undesirable. A story that was posted "2516 days ago" has zero value to the community today at most. If today it prevents a posting that provokes meaningful dialog the old story is likely detrimental. HN is a very different community than it was five years ago.
On the other hand, stories which would otherwise overwhelm the front page could automatically be merged into a single aggregate entry in one slot. As the community grows and diversifies, having another level of abstraction above "story" (which on other sites is taken up by "boards" or "tags") could be useful for organizing that content without disrupting the site too much.
One place a tool like this would be useful is on submission. If I post a link and something comes up that says "HEY! This was last posted 2 days ago and got 300 upvotes and had 45 comments, here's the link.", that would discourage reposting when the poster didn't know the article had been posted before.
HN does this but only when the link is same...

you can have people post similar stories from different media's and if they get picked at different times then this could be useful.

I thought it already did that.
Oh, maybe. I guess I've never submitted a link before to know that. Even still, if that algorithm can be improved, even slightly, it'd be better for curbing repeat posts.
Perhaps the de-duplicator as described in the OP could operate on posts within the same X-hour period (where X is 24-48 somewhere). I agree with you that automatically referencing previous postings of the same content would be great for older duplicates.