|
|
|
|
|
by GistNoesis
1180 days ago
|
|
>And if only scaling that context length weren't quadratic... There are transformers approximations that are not quadratic (available out of the box since more than a year) : Two schools of thoughts here : - People that approximate the neighbor search with something like "Reformer" and O(L log(L) ) time and memory complexity. - People that use a low-rank approximation of the attention product with something like "Linformer" with O(L) complexity but with more sensibility to transformer rank collapse |
|