Hacker News new | ask | show | jobs
by benreesman 897 days ago
This is indeed what I was referring to and along with RoPE and related techniques is a sort of "meta-attention" in which a cost-effective scalar pointwise calculation can hint the heavyweight attention mechanism with super-linear returns in practical use cases.

In more intuitive terms, your bog-standard transformer overdoes it in terms of considering all context equally in the final prediction, and we historically used rather blunt-force instruments like causally masking everything to zero.

These techniques are still heuristic and I imagine every serious shop has tweaks and tricks that go with their particular training setup, but the Rope shit in general is kind of a happy medium and exploits locality at a much cheaper place in the overall computation.

2 comments

My understanding is that Mistral uses a regular 4K RoPE that is "extends" the window size with SWA. This is based on looking at the results of Nous Research's Yarn-Mistral extension: https://huggingface.co/NousResearch/Yarn-Mistral-7b-128k and Self-Extend, both of which only apply to RoPE models.

There are quite a few recent attention extension techniques recently published:

* Activation Beacons - up to 100X context length extension in as little as 72 A800 hours https://huggingface.co/papers/2401.03462

* Self-Extend - a no-training RoPE modification that can give "free" context extension with 100% passkey retrieval (works w/ SWA as well) https://huggingface.co/papers/2401.01325

* DistAttention/DistKV-LLM - KV cache segmentation for 2-19X context length at runtime https://huggingface.co/papers/2401.02669

* YaRN - aforementioned efficient RoPE extension https://huggingface.co/papers/2309.00071

You could imagine combining a few of these together to basically "solve" the context issue while largely training for shorter context length.

There are of course some exciting new alternative architectures, notably Mamba https://huggingface.co/papers/2312.00752 and Megabyte https://huggingface.co/papers/2305.07185 that can efficiently process up to 1M tokens...

imo mistral-medium is worse than mixtral. Do you have API access?