Hacker News new | ask | show | jobs
by h0l0cube 1232 days ago
Why would it be more complicated? A simple dot product would have excellent scaling characteristics
2 comments

I designed the ranking algorithm for our product search. There are several factors in it that are nonlinear.

For example, we obviously want the most-viewed and most-bought products to rise to the top. However, we also expect there's an initial honeymoon period for many new products, where people want to see them but they don't have enough history yet to sway the popularity factor. So there's a non-linear term that looks kind of like

  ( weight_factor / (product_age_days * age_scale_constant) )
At a glance it seems like there works be a way to form the weighting adjustment into a subquery and multiply by the reciprocal. And then it’s still multiply-accumulate, but maybe I’m missing something.
Isn't that just saying "delegate the nonlinear part out to a black box"?
At a glance (1 / (product_age_days * age_scale_constant)) just seems like another factor you can multiply (with the calculation of reciprocal costing more in compute time). Again, I'm more than likely misunderstanding you or lacking the context.
When it comes to ML scalability is a constraint not a goal. The goal is to minimize some loss function and it turns out simple dot product can be outperformed by more complex algorithms.

I remember reading a few years ago that most search engines use some tree based model. If that's the case, that means the idea of monotonic linear weights is not relevant.

Can you be more specific? Dot product is about as performant as it gets with linear memory access and SIMD multiply accumulate. Throw random memory access and flow control in there and it’s a struggle to do it faster. Unless the factors are sparse, in which case just elide the zero values.
> scalability is a constraint not a goal. The goal is to minimize some loss function
My bad. I was under the impression that most search engines are compute bound, but if anything there’s probably a glut of compute for such applications and a market appetite for better results.
Also, I'd assume it's highly time-agnostic (i.e. content change timespan : compute availability timespan).

So you can run your bulk-recomputing whenever you have spare capacity.

Stale rankings aren't great, but don't hurt that much. As long as your liveness is more frequently updated, so you don't send people to dead sites.

Certainly caching is important, especially for Word2Vec or other NLP which you'd want to happen in a separate stage after crawl, but as someone mentioned in a sibling comment, there are some factors that are calculated per-query, which can have a lot of cache misses for novel queries.