|
|
|
|
|
by zackangelo
361 days ago
|
|
In your forward pass section you give a lot of emphasis to FlashAttention, but it might be worth mentioning Paged Attention as well (which was the paper written by the vLLM authors and I believe was the genesis of the project). PA-style block tables are now supported in most fused attention kernels, but vLLM originally came up with it and it's the main reason why vLLM has such high throughput! |
|