Hacker News new | ask | show | jobs
by dawnofdusk 843 days ago
The paper offers a reason why NNs working for tabular data would be good:

>Creating tabular-specific deep learning architectures is a very active area of research (see section 2) given that tree-based models are not differentiable, and thus cannot be easily composed and jointly trained with other deep learning blocks.

Here is a second reason, from the paper

>Impressed by the superiority of tree-based models on tabular data, we strive to understand which inductive biases make them well-suited for these data.

which is a great reason, because understanding the inductive biases of different learning/regression techniques gets us closer to a more general understanding of how to encode inductive biases in a generic learning algorithm.

1 comments

My hypothesis is decision trees are more robust to nonstationary distributions. If the variance and means of the features shift dramatically, the model isn't going to blow up, because it's not additive.

In the domains where NNs work well (image processing and language), you're dealing with a predictable and stable distribution of values. Elephants might look a bit different in the train and test set, but you're not randomly getting 100x the variance of the input data. The decision tree just isn't going to care as much, because splits around the mean will lead to the same outcome.

Another hypothesis is that zooming into bivariable relationships is more important in tabular data. Neural nets are better at local and global context. But they struggle if all that matters is the relationship between two columns of data because of the additive nature. Large networks can figure it out due to model capacity, but then you'll run into overfitting.

In case anyone's sufficiently motivated (no promises, but I might test it out eventually), a couple deep architectures that might address those concerns are:

1. Something like a deep support vector machine. Instead of (linear) -> (any activation), you want to create a bunch of features that look like testing the vector against a splitting hyperplane. One option is (bias) -> (matmul) -> (1-bit sigmoid). Applying a bias term _for each row_ let's you choose the branch location, the matmul's result will be positive or negative at each output feature depending on which side of the hyperplane normal to the vector described by the corresponding row you happen to fall on. Then just bring that down to -1 or 1 so you can't sneak much nonstationary drift variance into the output (perhaps train with a normal sigmoid annealed to behave more like this one, and a suitable regularizing term to keep the network from sneaking in values near 0 to thwart your annealing).

2. Use an attention-like mechanism, but across features (this would likely require an additional tensor channel, so that each "feature" carries information in a high enough dimensional space for this to do something meaningful). You apply the inductive bias that sparse feature interactions are important and need to be discovered.

Those two ideas also compose easily.

> this would likely require an additional tensor channel, so that each "feature" carries information in a high enough dimensional space

Suppose input data is [batch_size, num_features]. Then you do x.unsqueeze(1) giving you [batch_size, num_features, 1]. Then what?

You probably want something equivalent to (however you make it fast in your chosen framework):

einsum('bf,fc->bfc', batched_inputs, channel_embedding)

Then carry that info through the network and project it down at the end. It's roughly equivalent to the token embedding step in an LLM.