|
Superficially it is not that hard. If you can spin up another server, you can run Elasticsearch and stuff it with content through a JSON API, do searches through that API, etc. To get a search engine that satisfies people, however, takes more work, particularly work that is hard to productize. For instance, if you crawl a web site you can run into "web traps" that generate an infinite number of pages, you need to remove the stuff on the top, sides and bottom of the pages. People also have expectations for search based on Google and that is a whole ball of wax. Searching the whole web, in some sense, might be an easier problem than searching an individual web site; it becomes much more a problem of winnowing out the junk rather than trying to save a handful of relevant results from being lost. (Actually Google site search is not that good) The mainstream of work in relevance has been around problems similar to patent search, where it is important that result #5000 be (relatively) good. Google is focused on result #1 being good, as in many search applications people won't scroll past the first page. Often people deploy search solutions decide that they "suck" and give up. If you want search which is useful, never mind "world class", you need a lot of customization done by people who know dark arts. |