Hacker News new | ask | show | jobs
by godelski 900 days ago
I love and hate these.

I love them because they do give another resource at explaining models such as transformers and I think this one is pretty well done (note: you really need to do something about the equation in 4.2...)

First, the critique is coming from love. Great work, so I don't want it to be taken as I'm saying anything it isn't.

Why I hate these is that they are labeled as "math behind" but I think this is not quite fitting. This is the opposite of the complaint I made about the Introduction to DL post the other day[0]. The issue isn't that there isn't math, but contextually it is being labeled as a mathematical approach but I'm not seeing anything that distinguishes it as deeper than what you'd get from Karpathy's videos or the Annotated Transformer (I like this more than illustrated). There's nothing wrong with that, but just think it might mislead people, especially as there is a serious lack of places to find a much deeper mathematical explanation behind architectures and the naming makes it harder to find for those that are looking for that, because they'll find these posts. Simply, complaint is about framing.

To be clear, the complaint is just about the subtitle, because the article is good and a useful resource for people seeking to learn attention and transformers. But let me try to clarify some of what would I personally (welcome to disagree, it is an opinion) more accurately be representative of " demystifying all the math behind them":

- I would include a much deeper discussion of both embedding and positional embedding. The former you should at minimum be discussion how it is created and discussing the dequantization. This post may give a reader the impression that this is not taking place (there is ambiguity between distinction of embedding vs tokenization and embedding, this looks to just briefly mention tokenization. I specifically think a novice might take away that the dequantization is happening due to the positional encoding, and not in the embedding). The tokenization and embedding is a vastly underappreciated and incredibly important aspect of making discrete models work (not just LLMs or LMs. Principle is more general).

- Same goes for the positional embedding which I have only in a handful of cases seen discussed and taken rather matter of factly. For a mathematical explanation you do need to explain the idea behind generating unique signals for each position, explain why we need a a high frequency, and it is worth mentioning how this can be learnable (often with similar results, which is why most don't bother), and other forms like rotational. The principle is far more general than even a Fourier Series (unmentioned!). The continuous aspect also matters a lot here, and we (often) don't want discritized positional encoding. If this isn't explained it feels rather arbitrary, and in some ways it is but others it isn't.

- The attention mechanism is vastly under-explained, though I understand why. There are many approaches to tackle this, some from graphs, some from category theory, and many others. They're all valuable pieces to the puzzle. But at minimum I think there needs to be a clear identification as to what the dot product is doing, the softmax, the scale (see softmax tempering), and why we then have the value. Their key-query-value names were not chosen at random and the database analogy is quite helpful. Maybe many don't understand the relationship of dot products and angles between vectors? But this can even get complex as we would expect values to go to 0 in high dimensions (which they kinda do if you look at the attention matricies post learning which often look highly diagonal and why you can initialize them as diagonally spiked for sometimes faster training). This would be a great place to bring up how there might be some surprising aspects to the attention mechanism considering matrices represent affine transformations of data (linear) and we might not see the non-linearity here (softmax) or understand why softmax works better than other non-linears or normalizers (try it yourself!).

- There's more but I've written a wall. So I'll just say we can continue for the residuals (also see META's 3 Things Everyone Should Know About Vision Transformers, in the Deit repo), why we have pre-norm as opposed to the original post-norm (which it looks like post norm is being used!), the residuals (knot theory can help here a bit), and why we have the linear layer (similarly the unknotting discussion helps, especially quantifying why we like a 4x ratio, but isn't absolutely necessary).

Idk, are people interested in these things? I know most people aren't, and there's absolutely nothing wrong with that (you can still build strong models without this knowledge, but it is definitely helpful). I do feel that we often call these things black boxes but they aren't completely opaque. They sure aren't transparent, especially through scale, but they aren't "black" either. (Allen-Zhu & Li's Physics of LLMs is a great resource btw and I'd love if other users posted/referenced more things they liked. I purposefully didn't link btw)

So, I do like the post, and I think it has good value (and certainly there is always value in teaching to learn!), but I disagree with the HN title and post's subtitle.

[0] https://news.ycombinator.com/item?id=38834244

3 comments

Hey @godelski! Author of the blog post here.

I really appreciate you taking the time to provide all this feedback. This feedback + additional resources are extremely useful.

I agree that the subtitle is not as accurate as it could be. I'll revisit it! As for content updates, I've been doing some additional updates in the last days based on feedback (e.g. more info about tokenization and the token embeddings). Although diving in some of your suggestions is likely out of scope for this article, I in particular agree that expanding the attention mechanism content (e.g. the analogy with databases or explaining what is dot product) would increase the quality of the article. I will look into expanding this!

I also think a more rigorous, separate mathematical exploration into attention mechanisms and recent advancements would be a great tool for the ecosystem.

Once again, thank you for all the amazing feedback!

Hey, I'm glad you found it useful. I know it is hard to take critique, but I did enjoy the post. I truly do mean the critique is coming from a place of love. And I hope the comment helps others find more (I guess I'm writing a blog post now). I do feel there is often this gap between nearly no math and way too much math that causes a lot of people to come away with "you don't need math for ML" which is... idk... partially correct but not? haha. I'm a bit mathy of a person so you just caught a pet peeve of mine. I definitely agree what I said is out of scope for how you wrote but I will stand with my subtitle critique ;) I still do like the article though

And I just realized we're in a slack channel together haha (I don't think we've ever talked though). I poked around your website and saw you're at HF. Love you guys to death. You all also have tons of awesome blog posts and you're one of the most useful forces in ML. So I really do appreciate all the work.

Can you link a resource that is able to adequately explain why they're called Key, Query, and Value? Every explanation I've read eventually handwaved this. It feels like understanding why they're named that is key (heh) to understanding the concept, rather than just blindly implementing matmul.
https://stats.stackexchange.com/questions/421935/what-exactl...

It mentions it comes from the original Attention Is All You Need paper and goes on into more detail.

It seems to be named exactly as you would expect. Key/Value as in KV store, with Query being the term being retrived.

> Idk, are people interested in these things?

Yes, absolutely. Would be awesome to read deeper.

Seconded.
Looks like I'll be writing a blog lol.