|
|
|
|
|
by jonathankoren
1822 days ago
|
|
I’m pretty skeptical of this. I’ve run a lot of ML based A/B tests over my career. I’ve talked to a lot of people that have also run ML A/B tests over their careers. And the one constant everyone has discovered is that offline evaluation metrics are only somewhat directionally correlated with online metrics. Seriously. A/B tests are kind of a crap shoot. The systems are constantly changing. The online inference data drifts from the historical training data. User behavior changes. I’ve seen positive offline models perform flat. I’ve seen negative offline metrics perform positively. There’s just a lot of variance between offline and online performance. Just run the test. Lower the friction for running the tests, and just run them. It’s the only way to be sure. |
|
Any researcher will tell you: this is really hard. It is more than an engineering problem. You need to know not only how to deal with problems, but rather what problems may arise and what you can actually identify. Most importantly, you need to figure out what you can not identify.
There are, at least here in academia, only a limited set of people who are really good at this.
Long story short: even if offline analysis is viable, I doubt every team had the right people for it, making it potentially not worthwhile.
It is infinitely easier to produce a statistical analysis that looks good but isn’t, than one that is good. An overwhelming amount of useless offline models would, statistically speaking, be expected ;)