|
Haven't watched it yet... ...but, if you have favorite resources on understanding Q & K, please drop them in comments below... (I've watched the Grant Sanderson/3blue1brown videos [including his excellent talk at TNG Big Tech Day '24], but Q & K still escape me). Thank you in advance. |
Once you recognize this it's a wonderful re-framing of what a transformer is doing under the hood: you're effectively learning a bunch of sophisticated kernels (though the FF part) and then applying kernel smoothing in different ways through the attention layers. It makes you realize that Transformers are philosophically much closer to things like Gaussian Processes (which are also just a bunch of kernel manipulation).
0. http://bactra.org/notebooks/nn-attention-and-transformers.ht...