| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by GistNoesis 1180 days ago

>And if only scaling that context length weren't quadratic...

There are transformers approximations that are not quadratic (available out of the box since more than a year) :

Two schools of thoughts here :

- People that approximate the neighbor search with something like "Reformer" and O(L log(L) ) time and memory complexity.

- People that use a low-rank approximation of the attention product with something like "Linformer" with O(L) complexity but with more sensibility to transformer rank collapse