| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sfletcher 2042 days ago

I don't fully understand something about the general tech industry discourse around search and would love to hear if I'm wrong.

Here's my brief and slightly made up history of search engines:

In the beginning of time, search engines took a Boolean query (duck AND pond) and found all the documents which contained both words using an inverted index and then returned them in something like descending date order. But for queries which had big result sets, this order wasn't very useful and so search engines began letting users enter more "natural language" queries (duck pond) and sorting documents based on the number of terms that overlap with the query. They came up with a bunch of relevance formulas - tfidf, BM25 - that tried to model the query overlap. But it turns out this is tricky because user intent is a really tricky problem and so modern day search engines just declare that relevance is whatever users click on. Specifically they just model the probability that you're going to click on a link (or something) using a DNN that uses things like the individual term overlap, the number of users that have clicked on this link, the probability it's spam, the PageRank etc. Some search engines like Google also include personalized features like the number of times you have clicked on this particular domain - because for instance as a programmer your query of (Java) might have different intent than your grandmother's. This score then gets used to sort the results into a ranked list. This is why search engines (DDG included) collect all this data - because it makes the relevance problem tractable at web scale.

Maybe just my perspective but I just really don't understand why OP would want to build an index - it's hard boring expensive and doesn't violate data privacy - and I don't think people grasp that - at least to some extent - data privacy and relevance are in direct conflict?