Hacker News new | ask | show | jobs
by ActorNightly 175 days ago
>A single convolution step is a local operation (only pulling from nearby pixels), whereas attention is a "global" operation.

In the same way where the learned weights to generate K,Q,V matricies may have zeros (or small values) for referencing certain tokens, convolution kernels just have defined zeros.