Hacker News new | ask | show | jobs
by not_jd_salinger 1826 days ago
> GBDTs are still unbeatable.

You'd be surprised how many times I've replaced a GBDT with logistic regression and had negligible drop off in model performance with a dramatic improvement in both training time as well as debugging and fixing production models.

I've had plenty of cases where a bit of reasonable feature transformation can get a logistic model to outperform a gbdt. Any non-linearity your picking up with a GBDT can often easily be captured with some very simple feature tweaking.

My experience has been that GBDTs are only particularly useful in Kaggle contests, where minuscule improvements in an arbitrary metric are valuable and training time and model debugging are completely unimportant.

There are absolutely cases where NNs can go places that logistic regression can't touch (CV and NLP), but I have yet to see a real world production pipeline where GBDT provides enough improvement over Logistic Regression, to throw out all of the performance and engineering benefits of linear models.

5 comments

I strongly agree with this. Not to mention parameter interpretability and, in the case of Bayesian models, uncertainty estimates and convergence diagnostics. Such things are very important when making decision under uncertainty. Kaggle competitions and empirical benchmarks are very biased samples of model performance in real life.

I feel these two things often influence too much the course of Machine Learning research and communities, and this is not good. Most ML researchers and pratictioners are barely aware of the latest advances in parametric modelling, which is a shame. Multilevel models allow you to model response variables with explicit dependent structures. This is done through random (sometimes hierarchical) effects constrained by variance parameters. These parameters regularize the effects themselves and converge really well when fitting factors with high cardinality.

Also, multilevel models are very interesting when it comes to the bias-variance tradeoff. Having more levels in a distribution of random effects actually DECREASES [1] overfitting, which is fascinating.

[1] https://m-clark.github.io/posts/2019-05-14-shrinkage-in-mixe...

While I agree and it is surprising that multi-level/hierarchical modeling is rarely applied in industry (I used them extensively in academia and industry), dealing with hundreds or thousands of random effects in large data sets, especially in non-linear models, is a computational nightmare. And the benefits may not warrant those nightmares.
Finally multi-level/hierarchical modeling is starting to permeate industry thanks to Stan and company.

I use hierarchical modeling regularly to help build Zapier. So do other companies like Generable: https://www.generable.com/

I suspect hierarchical models will become the next “new” hot data structure in software engineering due to their ability to compact logic. https://twitter.com/statwonk/status/1363104221747421184?s=21

I don't know about permeating the industry. I know for example that the model that Airbnb used 3 years ago (things may have changed in the meantime) to forecast occupancy was a random-effects model maintained by a single person in Europe. I don't know about the penetrance of Generable and companies providing similar probabilistic modeling solutions, although I hope they succeed.

When I was working for one of the FAANGs, I was the only one using random effects models (that I know of), in particular non-linear random effects models with ~ hundreds of random effects. I was using a language/tool faster than Stan (fitting the same model with Stan would have taken hours, or more likely days), but making the models converge was always challenging. In addition, since most of my colleagues had a CS background and were in love with the latest not interpretable, brute force algorithm, and were scared of a more statistical approach they made no effort to learn, I faced pushback and skepticism despite the model working very well.

I love random effects model, and I build my technical career on them.

I think one of the main reasons is that there is no good Python library for doing linear mixed effect models. There is no sklearn implementation. There are some libraries that wrap R's lmer (probably using rpy2 or soemthing). The best native Python library I could find is statsmodels, and it has several shortfalls (saving a model to disk consumes hundreds of megabytes, the predict method is useless, it just predicts using the fixed effects, multi-level beyond just 1 group is not even clearly documented, and the syntax sucks if you really do it, nevermind actually implementing a predict method using those random effects). I think once someone does a decent sklearn implementation, it might take off. I've been thinking of doing an implementation for sklearn as a side project, but I'm not an ML researcher, just a practitioner, so it might suck :)
I used statsmodels for a while ... it's definitely possible to predict arbitrary inputs, it just a pain to fiddle in the right inputs ...
>You'd be surprised how many times I've replaced a GBDT with logistic regression and had negligible drop off in model performance with a dramatic improvement in both training time as well as debugging and fixing production models.

Not only reduced training time, but also less data needed for training. Which is particularly important if training on time-series data for something that changes over time, as older data is less useful.

> I have yet to see a real world production pipeline where GBDT provides enough improvement over Logistic Regression

Not my field at all, so "I know nooothing".

Are GBDT's very different from "plain" binary decision trees? I've seen the latter a lot in the context of particle experiments[1][2][3].

[1]: https://arxiv.org/abs/physics/0408124

[2]: http://cds.cern.ch/record/2289251/

[3]: https://arxiv.org/abs/2002.02534

Very simply: plain decision trees usually overfit to training data (and, therefore, perform very badly out of sample). So the important part isn't the tree but the boosting. How you go from an ensemble of weak learners to something that works.

And this boosting generalises to any learner. You can apply it to regression too. Again, the boosting part is really the key. The innovation isn't a new technique either, it is just the aggressive application of computing power to these problems.

They are the same concept under the hood, but a GBDT is an ensemble model using a number of trees in tandem that are grown to improve the performance of the overall model.
Uhm how do you deal with imbalanced data? Like I mean 99:1 or something? I’m always worried about feature engineering - in the right hands it’s great but I’d posit that majority of DSes out there do not have said hands. Much rather take a random forest with no manipulation and shittier (and hopefully less biased) results.
What are the size of the datasets? I have a hard time conceptualizing tabular business data large to be a problem.
consider the problem of "online advertising"
When you have billions of rows the performance savings can be nice.
One of my projects several years back ran both a LR model and a DNN against the same input data (albeit featurized differently). Accuracy, P&R were roughly the same (minor differences depending on the time horizon), but the LR model took maybe a half hour to train and five minutes to run; the DNN took about 24 hours to train and an hour or two to run.

This wasn't even particularly huge data compared to my other projects. But certainly at that scale, there are huge differences between regression & NNs.