| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tedmiston 2932 days ago

You should write one from scratch to get a deeper understanding of how hard it is to return highly relevant results quickly. Tokenizing, stemming, bag of words, and tf-idf for ranking get you to an MVP, but then you realize how good production grade search engines are today.

Solr is good. I've been wanting to try Lunr [1] for small sites.

[1]: https://github.com/olivernn/lunr.js

1 comments

mlthoughts2018 2932 days ago

I worked in a company previously where Solr was used to scale the business, and was not performant for us after a while.

We wrote our own search engine at that point. You are right that there are a lot of little “devil in the details” issues. But overall it was a fun experience.

This was needed to support some specific machine learning workflows in the search ranking process — which could not be used if we paid the high latency cost to first get preliminary results in Solr.

So we took a “create your own index data structures” approach with index data (both the normalized bag of words vectors and companion data like boolean filters), which allowed us to highly optimize the initial broad ranking query. Latency was low enough that it allowed the time cost of calling follow-on machine learning services.

This was for a fairly high-traffic product search engine at an online retailer. It ended up working very well and over a span of about two years we eventually rolled all search traffic onto the in-house platform, even the parts not needing the machine learning services, and our query latencies went down across all our traffic, and we retired the original Solr implementation.

Wouldn’t be the right choice for everyone, but it informs my opinion a lot about the worthwhileness of creating an in-house search engine to specifically replace Solr. I’d suspect a lot of medium-sized or large companies running Solr should seriously consider it.

link

tedmiston 2932 days ago

I realize my comment was a little vague around Solr — my experience is specifically more with Lucene than Solr itself, and I just think of Solr as an easier entry point to Lucene. Did you consider or end building your custom search engine on top of Lucene or something else? What about using a service like Algolia if something like that were available at the time?

link

sova 2932 days ago

Could you tell us more about the kinds of indices you used, and what you mean by boolean filters? Thank you!

link

mlthoughts2018 2932 days ago

Really it’s not fancy or anything. We used Eigen to represent our normalized bag of words matrix (term-document matrix) as a sparse matrix in CSC and CSR format (which means the data resides in three underlying arrays for the nonzero entries, with indexing conventions for how to use them).

Boolean & multi-choice indices are just companion arrays where position i corresponds to a property of document i in the index: boolean for binary attributes (for example, whether the item has free shipping or not), or using a bigger integer space to encode more options, like say an int8 coupled with helper functions that check which bit is set, maybe for some set of 8 categories the items can be filtered by).

The “index” is just the serialized arrays backing the sparse matrix, the arrays backing the filters, and helper functions for decoding what the filter bits mean.

A query is then just applying the filters followed by performing the sparse matrix inner product and sorting.

It’s very basic, but allows you to heavily optimize it, whether optimizing for deletes, writes, certain heavily used filters, etc.

And you can of course add whatever fancy NLP stuff on top of or in place of the sparse matrix as well.

link

sova 2932 days ago

It sounds like a great approach to enable both speed in searching and customizability in indexing -- no way you get to bit level box ticking with conventional means. Thanks for explaining. Bag of Words, last I recall reading, is a statistical method for predicting the next word, so I'm curious how that played out for you.

link

mlthoughts2018 2932 days ago

Bag of words is an encoding method for converting some sequence of words into a long, sparse vector of word counts. The vector has one component for each possible word of the vocabulary, and if the count of word i is N, then you store N in conponent i of the bag of words vector.

If you think of these as sparse row vectors (the columns correspond to all vocabulary entries), then you store them as a matrix where you stack on another row for each “document” in your data set.

Later on when you get a new “document” at query time, you transform it into the same bag of words vector format, and then an inner product between the matrix and the query vector corresponds to a type of relevance / similarity useful for sorting into a ranked order of results.

In practical situations you have to work harder, because you need more units of text than just words (such as n-grams), and the raw term counts usually need to be weighted (e.g. matching a 3-gram probably means more than matching a single word) or normalized (e.g. longer documents happen to have more words, but that doesn’t mean they are more similar), and you need to account for results that are historically more popular or results that are newer.

It’s a very simple approach to document search, but it works well and there are extensions that utilize word embeddings or models that predict rankings of results.

Once you get a system running with the term-document matrix, it is a nice platform for more advanced experimentation and machine learning feature development.

link