| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by FartyMcFarter 94 days ago
	Isn't transformer attention quadratic in complexity in terms of context size? In order to achieve 1M token context I think these models have to be employing a lot of shortcuts. I'm not an expert but maybe this explains context rot.

1 comments

vlovich123 94 days ago

Nope, there’s no tricks unless there’s been major architectural shifts I missed. The rot doesn’t come from inference tricks to try to bring down quadratic complexity of the KV cache. Task performance problems are generally a training problem - the longer and larger the data set, the fewer examples you have to train on it. So how do you train the model to behave well - that’s where the tricks are. I believe most of it relies on synthetically generated data if I’m not mistaken, which explains the rot.

link

FartyMcFarter 94 days ago

A quick Google search reveals terms such as "sparse attention" that are used to avoid quadratic runtime.

I don't know if Anthropic has revealed such details since AI research is getting more and more secretive, but the architectural tricks definitely exist.

link

vlovich123 93 days ago

Then you need to do a little bit deeper research. No one just applies sparse attention at inference time for a model not trained for it. They do this at training time because otherwise the task performance degrades too much.

link