Hacker News new | ask | show | jobs
by slashdev 2534 days ago
Storage and bandwidth are cheaper than ever before, people scrape a billion pages for much more mundane purposes these days, even for academic papers.

Having a full text index on that is more involved but hardly impossible. You're completely right that it's not at all Google's secret sauce. Bing has clearly indexed much more than that, plus invested a ton in actually returning good results from their index. And still nearly nobody cares. It's just not easy to make a better Google, and the people most likely to figure out how to do that already work there.

2 comments

The Common Crawl corpus is already available and stored on S3 - so analyzing billions of web pages is literally already available with an AWS account and a simple map reduce job.

I'd actually advocate for making public an anonymized list of actual search queries.

Domain specific search engines could evolved based on the demand of what has already been searched for.

Anonymizong search queries is extremely hard, if not impossible. See https://en.wikipedia.org/wiki/AOL_search_data_leak for example.
> It's just not easy to make a better Google

It depends which sense of "better" you mean. It's nearly trivial to make an ethically superior search engine by just not building the spyware bits of Google.

It's difficult to make a search engine that's "better" along the dimensions of speed, profitability, etc.

That exists, it's called duck duck go, and even less people care about it than Bing. For the most part, people don't actually care about Google collecting their entire search history and combining it with their other data on you. We may live to regret that in a hypothetical future where the government turns more authoritarian and requisitions that data for evil.
I made three statements. They're all true as far as I can see. Would the downvoters care to speak up?