| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hackerlight 843 days ago

My hypothesis is decision trees are more robust to nonstationary distributions. If the variance and means of the features shift dramatically, the model isn't going to blow up, because it's not additive.

In the domains where NNs work well (image processing and language), you're dealing with a predictable and stable distribution of values. Elephants might look a bit different in the train and test set, but you're not randomly getting 100x the variance of the input data. The decision tree just isn't going to care as much, because splits around the mean will lead to the same outcome.

Another hypothesis is that zooming into bivariable relationships is more important in tabular data. Neural nets are better at local and global context. But they struggle if all that matters is the relationship between two columns of data because of the additive nature. Large networks can figure it out due to model capacity, but then you'll run into overfitting.

1 comments

hansvm 843 days ago

In case anyone's sufficiently motivated (no promises, but I might test it out eventually), a couple deep architectures that might address those concerns are:

1. Something like a deep support vector machine. Instead of (linear) -> (any activation), you want to create a bunch of features that look like testing the vector against a splitting hyperplane. One option is (bias) -> (matmul) -> (1-bit sigmoid). Applying a bias term _for each row_ let's you choose the branch location, the matmul's result will be positive or negative at each output feature depending on which side of the hyperplane normal to the vector described by the corresponding row you happen to fall on. Then just bring that down to -1 or 1 so you can't sneak much nonstationary drift variance into the output (perhaps train with a normal sigmoid annealed to behave more like this one, and a suitable regularizing term to keep the network from sneaking in values near 0 to thwart your annealing).

2. Use an attention-like mechanism, but across features (this would likely require an additional tensor channel, so that each "feature" carries information in a high enough dimensional space for this to do something meaningful). You apply the inductive bias that sparse feature interactions are important and need to be discovered.

Those two ideas also compose easily.

hackerlight 843 days ago

> this would likely require an additional tensor channel, so that each "feature" carries information in a high enough dimensional space

Suppose input data is [batch_size, num_features]. Then you do x.unsqueeze(1) giving you [batch_size, num_features, 1]. Then what?

hansvm 843 days ago

You probably want something equivalent to (however you make it fast in your chosen framework):

einsum('bf,fc->bfc', batched_inputs, channel_embedding)

Then carry that info through the network and project it down at the end. It's roughly equivalent to the token embedding step in an LLM.