| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by thrwyndx 1239 days ago

You throw them into a machine learning model together with a big dataset of queries/urls annotated by humans for relevancy. Catboost is yandex's choice of model here.

> and doing it with an acceptable latency

Lots of interesting optimizations possible here, but the big obvious one is multiple level models: score documents with a cheap model (FastRank in yandex lingvo) first using a subset of the fastest available features, then rescore top docs with your best slow expensive model. Perhaps rescore multiple times at different points in the stack with models of varying complexity, at each index shard and after aggregating the results from subset/all shards. Also sort documents in each index shard by some other ML model with query-independent features to push all the junk to the end of the index where you'd likely skip it when running out of time budget to process a query.

> Also, what happened to Google page rank, is is still relevant today?

Vanilla 1990s' pagerank obviously not, but the idea of such graph-based calculations is still very useful yes.

1 comments

swyx 1238 days ago

> Vanilla 1990s' pagerank obviously not,

what did we learn about the flaws?

link

thrwyndx 1238 days ago

It's old and too simple for today's web, everyone knows it and everyone games it. But the idea behind it is still useful, just need more tricks, more ML etc.

link