|
|
|
|
|
by menaerus
513 days ago
|
|
Incorrect. Self-attention is a highly parallel algorithm that makes it a great candidate for being a memory-bound workload once you have enough compute. Both datacenter grade CPUs and GPUs have enough compute to carry out the self-attention computation but it is only the latter that has enough hi-bandwidth memory to make the algorithm really perform. If this hadn't been the case, the theory behind flash-attention wouldn't materialize, and it does, and reason being that (main) memory is slow. Deep FFWD networks OTOH are compute-bound. |
|