|
|
|
|
|
by mlthoughts2018
2930 days ago
|
|
Really it’s not fancy or anything. We used Eigen to represent our normalized bag of words matrix (term-document matrix) as a sparse matrix in CSC and CSR format (which means the data resides in three underlying arrays for the nonzero entries, with indexing conventions for how to use them). Boolean & multi-choice indices are just companion arrays where position i corresponds to a property of document i in the index: boolean for binary attributes (for example, whether the item has free shipping or not), or using a bigger integer space to encode more options, like say an int8 coupled with helper functions that check which bit is set, maybe for some set of 8 categories the items can be filtered by). The “index” is just the serialized arrays backing the sparse matrix, the arrays backing the filters, and helper functions for decoding what the filter bits mean. A query is then just applying the filters followed by performing the sparse matrix inner product and sorting. It’s very basic, but allows you to heavily optimize it, whether optimizing for deletes, writes, certain heavily used filters, etc. And you can of course add whatever fancy NLP stuff on top of or in place of the sparse matrix as well. |
|