|
|
|
|
|
by hansvm
843 days ago
|
|
In case anyone's sufficiently motivated (no promises, but I might test it out eventually), a couple deep architectures that might address those concerns are: 1. Something like a deep support vector machine. Instead of (linear) -> (any activation), you want to create a bunch of features that look like testing the vector against a splitting hyperplane. One option is (bias) -> (matmul) -> (1-bit sigmoid). Applying a bias term _for each row_ let's you choose the branch location, the matmul's result will be positive or negative at each output feature depending on which side of the hyperplane normal to the vector described by the corresponding row you happen to fall on. Then just bring that down to -1 or 1 so you can't sneak much nonstationary drift variance into the output (perhaps train with a normal sigmoid annealed to behave more like this one, and a suitable regularizing term to keep the network from sneaking in values near 0 to thwart your annealing). 2. Use an attention-like mechanism, but across features (this would likely require an additional tensor channel, so that each "feature" carries information in a high enough dimensional space for this to do something meaningful). You apply the inductive bias that sparse feature interactions are important and need to be discovered. Those two ideas also compose easily. |
|
Suppose input data is [batch_size, num_features]. Then you do x.unsqueeze(1) giving you [batch_size, num_features, 1]. Then what?