|
|
|
|
|
by benreesman
897 days ago
|
|
This is indeed what I was referring to and along with RoPE and related techniques is a sort of "meta-attention" in which a cost-effective scalar pointwise calculation can hint the heavyweight attention mechanism with super-linear returns in practical use cases. In more intuitive terms, your bog-standard transformer overdoes it in terms of considering all context equally in the final prediction, and we historically used rather blunt-force instruments like causally masking everything to zero. These techniques are still heuristic and I imagine every serious shop has tweaks and tricks that go with their particular training setup, but the Rope shit in general is kind of a happy medium and exploits locality at a much cheaper place in the overall computation. |
|
There are quite a few recent attention extension techniques recently published:
* Activation Beacons - up to 100X context length extension in as little as 72 A800 hours https://huggingface.co/papers/2401.03462
* Self-Extend - a no-training RoPE modification that can give "free" context extension with 100% passkey retrieval (works w/ SWA as well) https://huggingface.co/papers/2401.01325
* DistAttention/DistKV-LLM - KV cache segmentation for 2-19X context length at runtime https://huggingface.co/papers/2401.02669
* YaRN - aforementioned efficient RoPE extension https://huggingface.co/papers/2309.00071
You could imagine combining a few of these together to basically "solve" the context issue while largely training for shorter context length.
There are of course some exciting new alternative architectures, notably Mamba https://huggingface.co/papers/2312.00752 and Megabyte https://huggingface.co/papers/2305.07185 that can efficiently process up to 1M tokens...