| HN Mirror

This definitely is a key challenge. A lot of bias can creep in via picking different ways to judge relevance. Even just something seemingly simple as a click does not necessarily indicate success. Take click bait as an example, that's literally tricking people into believing something is more relevant than it really is. People click, but are they happy about it? And how would you measure that?

As the article states, human evaluated rankings are the best. But that is of course a relatively expensive process. And getting any group of people to do things in a consistent/systematic way is a challenge in it self. Is the group of people you pick to do this representative of your user base? Is what you let them evaluate representative of what your user base actually does? And what is the shelf life of these evaluations? Even something simple as cultural and racial biases can skew results quite a bit unintentionally and more than a few big name companies have fallen into this particular trap.

Relevance is inherently subjective. What's relevant for you might not be the same as what is relevant for me. Because we don't share the same context probably (goals, intentions, preferences, circumstances, locale, environment, etc.). Many ML based search solutions are effectively one size fits all type solutions that end up locking you into recommendation bubbles. I usually jokingly call this the "more of the FFing same algorithm". A lot of ML devolves into that.

I've been involved with a few machine learning based ranking projects over the years. To be clear, I'm not an ML expert and instead usually work on non ML based search projects (mostly Elasticsearch in recent years). With Elasticsearch, the learning to rank plugin is one of several ways you can leverage ML to improve ranking. It works best in very stable well understood domains where the feature and data set are relatively static and where there is an abundance of user data to work with. The few times I've seen rigorous and effective AB testing was on teams that had this.

This particular form of ML is of course hardly state of the art at this point. It involves manual feature extraction (i.e. no deep learning here) and relatively simplistic algorithms to basically tune things like boost factors and other parameters in queries. Variations of logistical regression basically. It's really expensive to do and more expensive to do well. Your starting point is basically a manually crafted query that already does the right things mostly.

And even so, in all these teams I noticed a pattern of things hitting a local optimum where obviously low hanging fruit in query improvements became hard to pick simply because of the overhead of "proving" it did not negatively impact rankings. Teams like this become change resistant. I've seen teams insist their ranking was awesome when users and product owners clearly had different ideas about that (i.e. they were raising valid complaints).

The paradox here is that most good changes here inevitably degrade the experience until you dial them in over subsequent releases. If you obsessively avoid that kind of disruptive change, your system stops improving altogether. If you have people obsessing over less than a percent deviations in metrics they are tracking, those changes never happen and teams get stuck chasing their own tails. I'm not kidding, I've been in meetings where people were debating fractions of a percent changes in metrics. Analysis paralysis is a thing with ML teams. They end up codifying their own biases and then are stuck with them until somebody shakes things up a little.