|
|
|
|
|
by tedmiston
2932 days ago
|
|
You should write one from scratch to get a deeper understanding of how hard it is to return highly relevant results quickly. Tokenizing, stemming, bag of words, and tf-idf for ranking get you to an MVP, but then you realize how good production grade search engines are today. Solr is good. I've been wanting to try Lunr [1] for small sites. [1]: https://github.com/olivernn/lunr.js |
|
We wrote our own search engine at that point. You are right that there are a lot of little “devil in the details” issues. But overall it was a fun experience.
This was needed to support some specific machine learning workflows in the search ranking process — which could not be used if we paid the high latency cost to first get preliminary results in Solr.
So we took a “create your own index data structures” approach with index data (both the normalized bag of words vectors and companion data like boolean filters), which allowed us to highly optimize the initial broad ranking query. Latency was low enough that it allowed the time cost of calling follow-on machine learning services.
This was for a fairly high-traffic product search engine at an online retailer. It ended up working very well and over a span of about two years we eventually rolled all search traffic onto the in-house platform, even the parts not needing the machine learning services, and our query latencies went down across all our traffic, and we retired the original Solr implementation.
Wouldn’t be the right choice for everyone, but it informs my opinion a lot about the worthwhileness of creating an in-house search engine to specifically replace Solr. I’d suspect a lot of medium-sized or large companies running Solr should seriously consider it.