|
|
|
|
|
by wxw
9 days ago
|
|
> SSA replaces the O(n²) dense attention pass with a learned sparse formulation that scales linearly with context length. > At 1M tokens, SubQ 1.1 Small requires 64.5x less compute than dense attention and runs 56x faster than FlashAttention-2. Awesome stuff. Solving context at the model architecture layer rather than trying to bolt on extra memory is the right direction IMO. |
|