| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by roadside_picnic 340 days ago

"The Bitter Lesson" certainly seems correct when applied to whatever the limit of the current state of the art is, but in practice solving day-to-day ML problems, outside of FAANG-style companies and cutting edge research, data is always much more constrained.

I have, multiple times in my career, solved a problem using simple, intelligible models that have empirically outperformed neural models ultimately because there was not enough data for the neural approach to learn anything. As a community we tend to obsess over architecture and then infrastructure, but data is often the real limiting factor.

When I was early in my career I used to always try to apply very general, data hungry, models to all my problems.. with very mixed success. As I became more skilled I started to be a staunch advocated of only using simple models you could understand, with much more successful results (which is what lead to this revised opinion). But, at this point in my career, I increasingly see that one's approach to modeling should basically be to approach the problem more information theoretically: try to figure out the model with a channel capacity that best matches your information rate.

As a Bayesian, I also think there's a very reasonable explanation for why "The Bitter Lesson" rings true over and over again. In ET Jaynes' writing he often talks about Bayes' Theorem in terms of P(D|H) (i.e. probably of the Data given the Hypothesis, or vice versa), but, especially in the earlier chapters, purposefully adds an X to that equation: P(D|H,X) where X is a stand in for all of our prior information about the world. Typically we think of prior data as being literal data, but Jaynes' points out that our entire world of understand is also part of our prior context.

In this view, models that "leverage human understanding" (i.e. are fully intelligible) are essentially throwing out information at the limit. But to my earlier point, if the data falls quite short of that limit, then those intelligible models are adding information in data constrained scenarios. I think the challenge in practical application is figuring out where the threshold is that you need to adopt a more general approach.

Currently I'm very much in love with Gaussian Processes that, for constrained data environments, offer a powerful combination of both of these methods. You can give the model prior hints at what things should look like in terms of the relative structure of the kernel and it's priors (e.g. there should be some roughly annual seasonal component, and one roughly weekly seasonal component) but otherwise let the data decide.