| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by aantix 2530 days ago

The Common Crawl corpus is already available and stored on S3 - so analyzing billions of web pages is literally already available with an AWS account and a simple map reduce job.

I'd actually advocate for making public an anonymized list of actual search queries.

Domain specific search engines could evolved based on the demand of what has already been searched for.

1 comments

Sander_Marechal 2530 days ago

Anonymizong search queries is extremely hard, if not impossible. See https://en.wikipedia.org/wiki/AOL_search_data_leak for example.

link