Hacker News new | ask | show | jobs
by dpflan 1269 days ago
Is there a "law of tokens" growth for LLMs, ala Moore's Law, but for LLM capabilities based upon token capacity?
1 comments

Complexity is quadratic in sequence length. For 512 tokens it is 262K, but for 4000 tokens it becomes 16M and goes OOM on a single GPU. We need about 100K-1M tokens to load whole books at once.

Since 2017 there have been hundreds of attempts to bring O(N^2) to O(N), but none of them replaced the vanilla attention yet in large models. They lose on accuracy. Maybe Flash attention has a shot (https://arxiv.org/abs/2205.14135).