| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by estebarb 1515 days ago

There is a lot of public literature on how to build a search engine: it is not that secret. You must crawl millions of websites, then process them, index, deduplicate, remove spam... And then you must create a frontend client that queries the indexes and returns the most relevant web pages very fast. Is "just" that.

You can build a toy search engine easily... in fact, is is a popular project in Information Retrieval courses in universities around the world. But scaling that toy to something really web scale requires vasts amounts of compute resources, money, time to debug.

Also, swapping an "algorithm" is not easy: it requires changing the indexes (postings files vs fast neighbour queries for embeddings? in memory? in disk for long tail queries?), compute infrastructure (single node? MapReduce? Graph processing like Pregel? something Deep Learning? are we building a knowledge graph?), which languages it will support (not all languages have the same resources).

But, there are open source components that could be leveraged to build a search engine: Apache Nutch + Apache Hadoop + ElasticSearch + TensorFlow + ...