| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by hansvm 843 days ago

In case anyone's sufficiently motivated (no promises, but I might test it out eventually), a couple deep architectures that might address those concerns are:

1. Something like a deep support vector machine. Instead of (linear) -> (any activation), you want to create a bunch of features that look like testing the vector against a splitting hyperplane. One option is (bias) -> (matmul) -> (1-bit sigmoid). Applying a bias term _for each row_ let's you choose the branch location, the matmul's result will be positive or negative at each output feature depending on which side of the hyperplane normal to the vector described by the corresponding row you happen to fall on. Then just bring that down to -1 or 1 so you can't sneak much nonstationary drift variance into the output (perhaps train with a normal sigmoid annealed to behave more like this one, and a suitable regularizing term to keep the network from sneaking in values near 0 to thwart your annealing).

2. Use an attention-like mechanism, but across features (this would likely require an additional tensor channel, so that each "feature" carries information in a high enough dimensional space for this to do something meaningful). You apply the inductive bias that sparse feature interactions are important and need to be discovered.

Those two ideas also compose easily.

1 comments

hackerlight 843 days ago

> this would likely require an additional tensor channel, so that each "feature" carries information in a high enough dimensional space

Suppose input data is [batch_size, num_features]. Then you do x.unsqueeze(1) giving you [batch_size, num_features, 1]. Then what?

link

hansvm 843 days ago

You probably want something equivalent to (however you make it fast in your chosen framework):

einsum('bf,fc->bfc', batched_inputs, channel_embedding)

Then carry that info through the network and project it down at the end. It's roughly equivalent to the token embedding step in an LLM.

link