Hacker News new | ask | show | jobs
by rcardo11 2159 days ago
Search engines are such a big thing they should be open sourced and distributed over the community. It's like the most basic infrastructure the internet needs to work and we are outsourcing it.
2 comments

Please describe in detail how you will distribute crawl, index, and ranking as basic infrastructure.
https://commoncrawl.org/

Obviously, not at the same level as google and there are other parts. But I believe we can do this together if we try to. People were talking about building their own search engine on elixir forums a while ago and many seemed interested.

The same way you decentralize anything else.

You can do crawling by using an extension that allows you to create a new tab, crawl data on your current url and send it up to the mothership.

You can actually do even better because you don't get SEO-hacks like disabling certain javascript when Google is on the page to improve speed.

I was thinking exactly this when I stumbled upon your comment, except I figured it should work for any private tab and it'd also need a browser that makes tabs private (and contained) by default.

It's a solution more easily solved by vc companies or government laws, because we're not seeing Google doing that in this lifetime, while FOSS solutions simply won't get the needed traction.

What happens when these self-hosted crawlers access illegal content in one's country?
The same thing that happens when a peer accesses an illegal torrent on his country? How is this relevant? It is a decentralized system, it shouldn't make a difference.
"Honstly officer. I didn't click on that link to CA imagery. It was my webcrawler."
> do even better

You just traded SEO as we know it for a scheme in which any rando can just upload the supposed contents of any URL.

So, couldn't you keep a database? Many people would upload the same url, whichever ones are bogus would get a low score, like shadowbanning. Say, use a dht with proof-of-work and things should work? Obviously I'm oversimplifying, but I see it as a solved problem by using a blockchain.

Also, ethically speaking, aren't we at the point of considering the idea of trusting random people smarter than trusting huge corporations whose only goals are to make more and more money?

How do you know which copies are bogus? It can't be just by saying that the one you have the most copies of is the right one. The problem is that most legit copies will be subtly different. While an attacker trying to forge page contents can make their copies identical. You can't do fuzzy matching when deciding what to store since that would require all be the nodes to agree on the fuzzy matching algorithm. That's going to mean hard-coding a complex algorithm that requires constant updates into your Blockchain infra.

A proof of work does not seem viable either. You're asking for the submitters to pass it for no reward, so the difficulty factor can't be particularly high. But then it becomes useless at blocking somebody who is actually deriving a benefit from submitting (fake) results.

The giant company will in this case build an index that's far superior. The crowd-sourced version will have huge amounts of duplication of popular pages, and massive underrepresentation of the long tail. And can you imagine how inefficient the distributed version will be both on storage and bandwidth. There can't be any facility for scheduling pages to be crawled at sensible intervals given the push model. The indexing nodes will just be flooded with pages they didn't actually want.

The crowd-sourced version will also not be "random people" like you suggested. A lot of them will have an agenda, and will be trying to manipulate the index to meet that agenda. And manipulate it in a way that's not useful to the people making searches. At least the company's goal of making money is furthered by building as useful an index as they can given the resource constraints.

Here's the way you do it...

The search engine page can be used for validation, just allow people pressing the back button on the page to tell you whether the results were useful or not.

Sounds like a lovely PhD dissertation!
That sounds a bit like yacy or searx:

https://yacy.net/ https://searx.me/