Hacker News new | ask | show | jobs
by stelonix 2159 days ago
So, couldn't you keep a database? Many people would upload the same url, whichever ones are bogus would get a low score, like shadowbanning. Say, use a dht with proof-of-work and things should work? Obviously I'm oversimplifying, but I see it as a solved problem by using a blockchain.

Also, ethically speaking, aren't we at the point of considering the idea of trusting random people smarter than trusting huge corporations whose only goals are to make more and more money?

1 comments

How do you know which copies are bogus? It can't be just by saying that the one you have the most copies of is the right one. The problem is that most legit copies will be subtly different. While an attacker trying to forge page contents can make their copies identical. You can't do fuzzy matching when deciding what to store since that would require all be the nodes to agree on the fuzzy matching algorithm. That's going to mean hard-coding a complex algorithm that requires constant updates into your Blockchain infra.

A proof of work does not seem viable either. You're asking for the submitters to pass it for no reward, so the difficulty factor can't be particularly high. But then it becomes useless at blocking somebody who is actually deriving a benefit from submitting (fake) results.

The giant company will in this case build an index that's far superior. The crowd-sourced version will have huge amounts of duplication of popular pages, and massive underrepresentation of the long tail. And can you imagine how inefficient the distributed version will be both on storage and bandwidth. There can't be any facility for scheduling pages to be crawled at sensible intervals given the push model. The indexing nodes will just be flooded with pages they didn't actually want.

The crowd-sourced version will also not be "random people" like you suggested. A lot of them will have an agenda, and will be trying to manipulate the index to meet that agenda. And manipulate it in a way that's not useful to the people making searches. At least the company's goal of making money is furthered by building as useful an index as they can given the resource constraints.

Here's the way you do it...

The search engine page can be used for validation, just allow people pressing the back button on the page to tell you whether the results were useful or not.

What was being proposed was a way of decentralising the crawling. I tried to demonstrate with some examples why that could not work: you'd end up with an extremely inefficient index. What you're proposing does not solve any of those problems. Sure, you'll get a weak signal about page quality, but far too late in the pipeline to inform the decentralized crawling and indexing.

But further, you are not really thinking through how one would abuse this kind of a feature. If doing seo, I wouldn't forge a page to have content that make it be returned for irrelevant searches. Instead I would forge some high quality pages to show up as having backlinks to my page, and boost its pagerank. Or to demote the page of people I dislike, I'd forge it to have results that make it not show up on any searches. Your heuristic would not work there: if there's no clicks in the first place, there can't be any bounces.