|
|
|
|
|
by antoineMoPa
479 days ago
|
|
What I don't get about attention is why it would be necessary when a fully connected layer can also "attend" to all of the input. With very small datasets (think 0 - 500 tokens), I found that attention makes training longer and results worse. I guess the benefits show up with much larger datasets. Note that I'm an AI noob just doing some personal AI projects, so I'm not exactly a reference. |
|
Attention, by contrast, would treat those two occurrences similarly, with the only difference depending on positional encoding - so you can learn generalized patterns more easily.