| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by 101101001010 2218 days ago
	They are also efficient at inference-time. On GPUs the difference is noticeable only for sequences of length > 1024 (2048 for reformer since it adds some operations for hashing) thanks to the massive parallelism of GPUs amortizing the quadratic effect of the "usual" self-attention mechanism. [edit] Linformer (https://arxiv.org/pdf/2006.04768.pdf) is a different project from the one linked in https://linear-transformers.com/ (Transformers are RNNs https://arxiv.org/pdf/2006.16236.pdf).

1 comments

Thanks also for the edit and the heads up!! Missed that