Hacker News new | ask | show | jobs
by aantix 2530 days ago
The Common Crawl corpus is already available and stored on S3 - so analyzing billions of web pages is literally already available with an AWS account and a simple map reduce job.

I'd actually advocate for making public an anonymized list of actual search queries.

Domain specific search engines could evolved based on the demand of what has already been searched for.

1 comments

Anonymizong search queries is extremely hard, if not impossible. See https://en.wikipedia.org/wiki/AOL_search_data_leak for example.