Hacker News new | ask | show | jobs
by arturventura 1401 days ago
I'm working on building an AWS for anyone who wants to make their own search engine. The idea is to have a single open webindex database, continuously updated that you can apply ranking and embedding algorithms in it. This would reduce the cost of entry, and enable developers to build competitors to google on top of it, or create new products in the search space like a search engine for clothes. I don't know if this is interesting for anyone but if it is, hit me up.
4 comments

That sounds very cool, and I hope you (and your customers!) are successful. Out of curiosity, did you find an existing market need for that, or it's a "build it and they will come" model?

Also, have you thought about partnering with commoncrawl.org? I could see that relationship benefiting both sides: they get fresher indices, you get access to the historical web snaps

I faced the problem. I think one of the main issues with google is the modality of the results. Google is forced to create a list of links because that's the main vehicle where they drive profit. If you were to send a question like "Who is Barack Obama?" you still will get a list of links although google knows there is a canonical answer.

The problem is that if you were to build a new search engine from the ground up it will take millions in infrastructure, and a lot of time for you to test one idea. And there are multiple attack vectors to Google's business model (privacy, subscription model, modality, etc.) however you might get the change of testing one of them, and if that fails, starting again is super expensive so you might not be able to get funds to do it.

My approach then became to build something that others can build on top of.

I'm currently using common crawl but my main problem is that I need to build a small toy to test it and even processing common crawl is crazy expensive. Just a single snap are 150 Tb, so this needs to be process on metal, or you're gonna pay a hefty AWS bill.

> If you were to send a question like "Who is Barack Obama?" you still will get a list of links although google knows there is a canonical answer.

For that specific search I would start at Wikipedia, but for more general "data search" I lean towards Wolfram Alpha, which has some usability issues, but interesting maths engine for queries. https://www.wolframalpha.com/input?i=Barack+Obama+vs+Donald+...

It sounds like just what need to break free from Google.

I’ve been dreaming of an open web index and social graph for more than a decade.

Any company having the data + the algorithm + the presentation layer is way too much power. We can and should split that problem into its separate domains.

I hope you succeed, keep us posted.

Have you seen Common Crawl? https://commoncrawl.org/. If so, what differences do you imagine for yours?
> continuously updated

is what I saw as the primary difference. Whether that's going to pan out in reality as well as it does in HN comments is "the devil's in the details" though

Wouldn't it be prohibitively expensive for you to crawl and index the web?