|
|
|
|
|
by SPascareli13
95 days ago
|
|
If the model is trained to be a interpreter, then that means that the loss should reach 0 for it to be fully trained? Also, if it's execution is purely deterministic, you probably don't need non linearity in the layers, right? |
|
You need non-linearity in self-attention because it encodes feature / embedding similarities / correlations (e.g. self-attention is kernel smoothing) and/or multiplicative interactions, it has nothing to do with determinism/indeterminism. Also, LLMs are not really nondeterministic in any serious way, that all just comes from tweaks and optimizations that are not at all core to the architecture.