| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by littlestymaar 602 days ago

> We can take to gain orders of magnitude more performance just like the leap that the Transformers paper had.

Afaik the most important benefit of transformers aren't their “performance” (in the sense of ability to perform their tasks) but their scalability which come from their ability to be trained and evaluated efficiently on big GPU clusters, which isn't something you can do with recurrent neural networks.

And then, if I understood correctly, the benefit of state-space models being that you can train them in parallel and run them in a recurrent fashion, making inference cheaper than transformers especially when context size grow.

1 comments

quantadev 602 days ago

The biggest thing I had understood about the Transformers Paper (Attention is all you Need) is how the "attention heads" vectors are wired up in such a way as to allow words to be "understood" in the proper context. In other words "see spot run" is different from "run a computer program" has dramatically different but specific context for the word "run".

It was also my understanding that without those attention heads even the scaling up to current parameter sizes we have to day would not have ended up with the level of emergent intelligence that shocked the world with GPT 3.5. We needed both very large models and words put into semantic context in semantic space.

link

littlestymaar 601 days ago

Attention heads existed before Transformers, they where used in recurrent neural networks (RNN) to improve their performance. The paper is called “Attention is all you need” because transformers keep the attention head while discarding the RNN part entirely.

Getting rid of RNN vastly improved training scalability and allowed big players to start training enormous models on even more enormous training set in ways that weren't possible with a RNN AFAIK.

link

quantadev 601 days ago

When discussing "Attention Heads" in the context of the Transformers Paper, there's no need to put the word "Self" in front of it, as in "Self-Attention". That's the context in which I used the word Attention above. Something similar to self-attention had pre-existed this paper, but not actual self-attention.

You're right that getting rid of "Recurrence" was another innovation, but removing it was probably more of a hack to make things parallelizable, than something that was architecturally justifiable from first principles (like self-attention is), because there's definite "power" in Recurrence (making it desirable), but it's just too costly to run that in LLMs because of CPU cycles.

link

littlestymaar 600 days ago

> removing it was probably more of a hack to make things parallelizable

But that's the entire point of it. Transformer-based LLM are “more intelligent” just because you can make them bigger and train them on bigger datasets because of this parallelization.

link

quantadev 600 days ago

It's not just about size. Self-Attention is every bit as important as large size, because if we had the current large size, but without Self-Attention we wouldn't have the emergent intelligence. Also "size" isn't even a new innovation. Self-Attention was a new innovation.

link

littlestymaar 600 days ago

This doesn't match with the common knowledge on the topic, which is that model size is more important than the architecture. And training size is even more important, which is why single digit billion parameters are strongers than hundreds-of-billion ones from several years early when “Chinchilla optimal training” was in fashion.

SSM are literally the proof that all that really matters is training scalability.

The Universal approximation theorem doesn't care about the architecture after all.

link