|
|
|
|
|
by julius
317 days ago
|
|
Less information loss -> Less params? Please correct me if I got this wrong. The Intro claims: "The dot product itself is a geometrically impoverished measure, primarily capturing alignment while conflating magnitude with direction and often
obscuring more complex structural and spatial relationships [10, 11, 4, 61, 17]. Furthermore, the way current activation functions achieve non-linearity can exacerbate this issue. For instance, ReLU (f (x) = max(0, x)) maps all negative pre-activations, which can signify a spectrum of relationships from weak dissimilarity to strong anti-alignment, to a single zero output. This thresholding, while promoting sparsity, means the network treats diverse inputs as uniformly orthogonal or linearly independent for onward signal propagation. Such a coarse-graining of geometric relationships leads to a tangible loss of information regarding the degree and nature of anti-alignment or other neg-
ative linear dependencies. This information loss, coupled with the inherent limitations of the dot product, highlights a fundamental challenge." |
|