| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by _gwlb 2445 days ago

I worked at a larger services marketplace, helping data scientists get their models into production as A/B experiments. We had an interesting and related challenge in our search ranking algorithms: we wanted to rank order results by the predicted lifetime value of establishing a relationship between searcher and each potential service provider. In our case, a 1% increase in LTV from one of these experiments would be...big. Really big.

Improving performance of these ranking models was notoriously difficult. 50% of the experiments we'd run would show no statistically significant change, or would even decrease performance. Another 40% or so would improve one funnel KPI, but decrease another, leading to no net improvement in $$. Only 10% or so of experiments would actually show a marginal improvement to cohort LTV.

I'm not sure how much of this is actually "there's very little marginal value to be gained here" versus lack of rigor and a cohesive approach to modeling. The data scientists were very good at what they do, but ownership of models frequently changed hands, and documentation and reporting about what experiments had previously been tried was almost non-existent.

All that to say, productizing ML/AI is very time- and resource-intensive, and it's not always clear why something did/didn't work. It also requires a lot of supporting infrastructure and a data platform that most startups would balk at the cost of.

1 comments

JimmyRuska 2445 days ago

If you have historical data to validate against, you can set a leader board on models run against older data, and always leave part of the data out and unavailable for test

https://gluebenchmark.com/leaderboard/

This encourages a simple first version and incremental complexity, rather than starting very complex 6 months in, and never having an easy baseline to compare to. A simple baseline can spawn off several creative methods of improvement to research.

The other case is that the models should be run against simple cases that are easy to understand and easy to confirm. This way there's always a human QA component available to make sure results are sensible.

link