Hacker News new | ask | show | jobs
by frgtpsswrdlame 843 days ago
I still don't get the impetus or desire to make NNs work better for tabular data. Regression works pretty well and is easy to interpret/diagnose/work with. GBMs work really well (given a few considerations) and is trickier to work with but nothing crazy. When I see all the fancy hijinks people get up to when applying NNs to audio/text/pictures I think it's really cool but also not something I'd want to have to do if I didn't absolutely need to when working with data out of a relational db. And anyways, how much of a benefit could it actually bring? GBMs are already capable of fitting and dramatically overfitting most datasets.
4 comments

The paper offers a reason why NNs working for tabular data would be good:

>Creating tabular-specific deep learning architectures is a very active area of research (see section 2) given that tree-based models are not differentiable, and thus cannot be easily composed and jointly trained with other deep learning blocks.

Here is a second reason, from the paper

>Impressed by the superiority of tree-based models on tabular data, we strive to understand which inductive biases make them well-suited for these data.

which is a great reason, because understanding the inductive biases of different learning/regression techniques gets us closer to a more general understanding of how to encode inductive biases in a generic learning algorithm.

My hypothesis is decision trees are more robust to nonstationary distributions. If the variance and means of the features shift dramatically, the model isn't going to blow up, because it's not additive.

In the domains where NNs work well (image processing and language), you're dealing with a predictable and stable distribution of values. Elephants might look a bit different in the train and test set, but you're not randomly getting 100x the variance of the input data. The decision tree just isn't going to care as much, because splits around the mean will lead to the same outcome.

Another hypothesis is that zooming into bivariable relationships is more important in tabular data. Neural nets are better at local and global context. But they struggle if all that matters is the relationship between two columns of data because of the additive nature. Large networks can figure it out due to model capacity, but then you'll run into overfitting.

In case anyone's sufficiently motivated (no promises, but I might test it out eventually), a couple deep architectures that might address those concerns are:

1. Something like a deep support vector machine. Instead of (linear) -> (any activation), you want to create a bunch of features that look like testing the vector against a splitting hyperplane. One option is (bias) -> (matmul) -> (1-bit sigmoid). Applying a bias term _for each row_ let's you choose the branch location, the matmul's result will be positive or negative at each output feature depending on which side of the hyperplane normal to the vector described by the corresponding row you happen to fall on. Then just bring that down to -1 or 1 so you can't sneak much nonstationary drift variance into the output (perhaps train with a normal sigmoid annealed to behave more like this one, and a suitable regularizing term to keep the network from sneaking in values near 0 to thwart your annealing).

2. Use an attention-like mechanism, but across features (this would likely require an additional tensor channel, so that each "feature" carries information in a high enough dimensional space for this to do something meaningful). You apply the inductive bias that sparse feature interactions are important and need to be discovered.

Those two ideas also compose easily.

> this would likely require an additional tensor channel, so that each "feature" carries information in a high enough dimensional space

Suppose input data is [batch_size, num_features]. Then you do x.unsqueeze(1) giving you [batch_size, num_features, 1]. Then what?

You probably want something equivalent to (however you make it fast in your chosen framework):

einsum('bf,fc->bfc', batched_inputs, channel_embedding)

Then carry that info through the network and project it down at the end. It's roughly equivalent to the token embedding step in an LLM.

When you need the best possible model, full stop.

E.g. finance

In a sufficiently competitive space, good enough doesn't cut it.

There is no such thing as "best possible model, full stop". Models are always context dependent, have implicit or explicit assumptions about what is signal and what is noise, have different performance characteristics in training or execution. Choosing the "best" model for your task is a form of hyperparameter optimization in itself.
I can’t upvote this enough. Whether in life, or with models, some people really do believe in the myth of absolutely meritocracy
Do you know of any shop that is running deep learning profitably?
Plenty of places use DL models, even if it's just a component of their stack. I would guess that that gradient-boosted trees are more common in applications, though.
Do you know what kind of strategies it's seeing use in?
Still mostly NLP and image stuff. Most actual data in the wild is tabular - which GBTs are usually some combination of better and easier. In some circumstances, NN can still work well in tabular problems with the right feature engineering or model stacking.

They are also more attractive for streaming data. Tree-based models can't learn incrementally. They have to be retrained from scratch each time.

ML is very good at figuring out stuff like every day at 22:00 this asset goes up if this another asset is not at a daily maximum and the volatility of the market is low.

You might call this overfitting/noise/.... but if you do it carefully it's profitable.

Real-time parsing of incoming news events and live scanning of internet news sites - coupled with sentiment analysis. Latency is an interesting challenge in that space.
Multiple parts of the iPhone stack run DL models locally on your phone. They even added hardware acceleration to the camera because most of the picture quality upgrades is software rather than hardware.
These models usually have poorer fit though
At this point I wish every junior DS could read this paper and not come in to every problem with the new bright idea that they’re going to beat XGBoost with their DL architecture. Free promotion if they never say the words “latent subspace”
One of those juniors is going to do it once!
because smooth is better than jagged :)