| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by atgctg 971 days ago
	A lot of transformer explanations fail to mention what makes self attention so powerful. Unlike traditional neural networks with fixed weights, self-attention layers adaptively weight connections between inputs based on context. This allows transformers to accomplish in a single layer what would take traditional networks multiple layers.

3 comments

WhitneyLand 971 days ago

In case it’s confusing for anyone to see “weight” as a verb and a noun so close together, there are indeed two different things going on:

1. There are the model weights, aka the parameters. These are what get adjusted during training to do the learning part. They always exist.

2. There are attention weights. These are part of the transformer architecture and they “weight” the context of the input. They are ephemeral. Used and discarded. Don’t always exist.

They are both typically 32-bit floats in case you’re curious but still different concepts.

link

airstrike 971 days ago

I always thought the verb was "weigh" not "weight", but apparently the latter is also in the dictionary as a verb.

Oh well... it seems like it's more confusing than I thought https://www.merriam-webster.com/wordplay/when-to-use-weigh-a...

link

bobbylarrybobby 971 days ago

“To weight” is to assign a weight (e.g., to weight variables differently in a model), whereas “to weigh” is to observe and/or record a weight (as a scale does).

link

DavidSJ 971 days ago

A few other cases of this sort of thing:

affect (n). an emotion or feeling. "She has a positive affect."

effect (n). a result or change due to some event. "The effect of her affect is to make people like her."

affect (v). to change or modify [X], have an effect upon [X]. "The weather affects my affect."

effect (v). to bring about [X] or cause [X] to happen. "Our protests are designed to effect change."

Also:

cost (v). to require a payment or loss of [X]. "That apple will cost $5." Past tense cost: "That apple cost $5."

cost (v). to estimate the price of [X]. "The accounting department will cost the construction project at $5 million." Past tense costed. "The accounting department costed the construction project at $5 million."

link

owlbite 971 days ago

I think in most deployments, they're not fp32 by the time you're doing inference no them, they've been quantized, possibly down to 4 bits or even fewer.

On the training side I wouldn't be surprised if they were bf16 rather than fp32.

link

kirill5pol 971 days ago

I think a good way of explaining #2 is “weight” in the sense of a weighted average

link

kmeisthax 971 days ago

None of this seems obvious just reading the original Attention is all you need paper. Is there a more in-depth explanation of how this adaptive weighting works?

link

albertzeyer 971 days ago

The audience of this paper are other researchers who already know the concept of attention, which was very well known already in the field. In such research papers, such things are never explained again, as all the researchers already know this or can read other sources, which are cited, but focus on the actual research questions. In this case, the research question was simply: Can we get away by just using attention and not using the LSTM anymore? Before that, everyone was using both together.

I think learning it following it more this historical development can be helpful. E.g. in this case here, learn the concept of attention, specifically cross attention first. And that is this paper: Bahdanau, Cho, Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate", 2014, https://arxiv.org/abs/1409.0473

That paper introduces it. But even that is maybe quite dense, and to really grasp it, it helps to reimplement those things.

It's always dense, because those papers already have space constraints given by the conferences, max 9 pages or so. To get a better detailed overview, you can study the authors code, or other resources. There is a lot now about those topics, whole books, etc.

link

BOOSTERHIDROGEN 971 days ago

What books cover exclusively about this topic ? Thanks

link

albertzeyer 971 days ago

This is frequently a topic here on HN. E.g.:

https://udlbook.github.io/udlbook/ (https://news.ycombinator.com/item?id=38424939)

https://fleuret.org/francois/lbdl.html (https://news.ycombinator.com/item?id=35767789)

https://www.fast.ai/ (https://news.ycombinator.com/item?id=24237207)

https://d2l.ai/ (https://news.ycombinator.com/item?id=38428225)

Some more:

https://news.ycombinator.com/item?id=35543774

There is a lot more. Just google for "deep learning", and you'll find a lot of content. And most of that will cover attention, as it is a really basic concept now.

link

lubesGordi 970 days ago

Thanks for the udl book (Understanding Deep Learning), that looks like a really great starting point.

link

stevenbedrick 970 days ago

To add to the excellent resources that have already been posted, Chapter 9 of Jurafsky and Martin's "Speech and Language Processing" has a nice overview of attention, and the next chapter talks specifically about the Transformer architecture: https://web.stanford.edu/~jurafsky/slp3/

link

CardenB 971 days ago

I doubt any.

link

WhitneyLand 971 days ago

It’s definitely not obvious no matter how smart you are! The common metaphor used is it’s like a conversation.

Imagine you read one comment in some forum, posted in a long conversation thread. It wouldn’t be obvious what’s going on unless you read more of the thread right?

A single paper is like a single comment, in a thread that goes on for years and years.

For example, why don’t papers explain what tokens/vectors/embedding layers are? Well, they did already, except that comment in the thread came 2013 with the word2vec paper!

You might think wth? To keep up with this some one would have to spend a huge part of their time just reading papers. So yeah that’s kind of what researchers do.

The alternative is to try to find where people have distilled down the important information or summarized it. That’s where books/blogs/youtube etc come in.

link

andai 971 days ago

Is there a way of finding interesting "chains" of such papers, short of scanning the references / "cited by" page?

(For example, Google Scholar lists 98797 citations for Attention is all you need!)

link

WhitneyLand 971 days ago

As a prerequisite to the attention paper? One to check out is:

A Survey on Contextual Embeddings https://arxiv.org/abs/2003.07278

Embeddings are sort of what all this stuff is built on so it should help demystify the newer papers (it’s actually newer than the attention paper but a better overview than starting with the older word2vec paper).

Then after the attention paper an important one is:

Language Models are Few-Shot Learners https://arxiv.org/abs/2005.14165

I’m intentionally trying to not give a big list because they’re so time-consuming. I’m sure you’ll quickly branch out based on your interests.

link

kaimac 971 days ago

I found these notes very useful. They also contain a nice summary of how LLMs/transformers work. It doesn't help that people can't seem to help taking a concept that has been around for decades (kernel smoothing) and giving it a fancy new name (attention).

http://bactra.org/notebooks/nn-attention-and-transformers.ht...

link

CyberDildonics 971 days ago

It's just as bad a "convolutional neural networks" instead of "images being scaled down"

link

taneq 971 days ago

“Convolution” is a pretty well established word for taking an operation and applying it sliding-window-style across a signal. Convnets are basically just a bunch of Hough transforms with learned convolution kernels.

link

gtoubassi 971 days ago

I struggled to get an intuition for this, but on another HN thread earlier this year saw the recommendation for Sebastian Raschka's series. Starting with this video: https://www.youtube.com/watch?v=mDZil99CtSU and maybe the next three or four. It was really helpful to get a sense of the original 2014 concept of attention which is easier to understand but less powerful (https://arxiv.org/abs/1409.0473), and then how it gets powerful with the more modern notion of attention. So if you have a reasonable intuition for "regular" ANNs I think this is a great place to start.

link

andai 971 days ago

Turns out Attention is all you need isn't all you need!

(I'm sorry)

link

pizza 971 days ago

softmax(QK) gives you a probability matrix of shape [seq, seq]. Think of this like an adjacency matrix with edges with flow weights that are probabilities. Hence semantic routing of parts of X reduced with V.

where

- Q = X @ W_Q [query]

- K = X @ W_K [key]

- V = X @ V [value]

- X [input]

hence

attn_head_i = (softmax(Q@K/normalizing term) @ V)

Each head corresponds to a different concurrent routing system

The transformer just adds normalization and mlp feature learning parts around that.

link

lchengify 971 days ago

Just to add on, a good way to learn these terms is to look at the history of neural networks rather than looking at transformer architecture in a vacuum

This [1] post from 2021 goes over attention mechanisms as applied to RNN / LSTM networks. It's visual and goes into a bit more detail, and I've personally found RNN / LSTM networks easier to understand intuitively.

[1] https://medium.com/swlh/a-simple-overview-of-rnn-lstm-and-at...

link