| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jaidhyani 1202 days ago

Skimming it, there are a few things about this explanation that rub me just slightly the wrong way.

1. Calling the input token sequence a "command". It probably only makes sense to think of this as a "command" on a model that's been fine-tuned to treat it as such.

2. Skipping over BPE as part of tokenization - but almost every transformer explainer does this, I guess.

3. Describing transformers as using a "word embedding". I'm actually not aware of any transformers that use actual word embeddings, except the ones that incidentally fall out of other tokenization approaches sometimes.

4. Describing positional embeddings as multiplicative. They are generally (and very counterintuitively to me, but nevertheless) additive with token embeddings.

5. "what attention does is it moves the words in a sentence (or piece of text) closer in the word embedding" No, that's just incorrect.

6. You don't actually need a softmax layer at the end, since here they're just picking the top token and they can just do that pre-softmax since it won't change. It's also weird how they talked about this here when the most prominent use of softmax in transformers is actually in the attention component.

7. Really shortchanges the feedforward component. It may be simple, but it's really important to making the whole thing work.

8. Nothing about the residual

10 comments

eiz 1201 days ago

> 4. Describing positional embeddings as multiplicative. They are generally (and very counterintuitively to me, but nevertheless) additive with token embeddings.

Worth noting that rotary position embeddings, used in many recent architectures (LLaMA, GPT-NeoX, ...), are very similar to the original sin/cos position embedding in the transformer paper but using complex multiplication instead of addition

jaidhyani 1200 days ago

TIL. Man, I'm behind on my paper reading.

sillysaurusx 1202 days ago

The positional embedding can be thought of: in the same way you can hear two pieces of music overlaid on each other, you can add both the vocab and pos embedding and it’s able to pick them apart.

If you asked yourself to identify when someone’s playing a high note or low note (pos embedding) and whether they’re playing Beethoven or Lady Gaga (vocab embedding) you could do it.

That’s why it’s additive and why it wouldn’t make much sense for it to be multiplicative.

isaacfung 1201 days ago

The visualisation here may be helpful.

https://github.com/tensorflow/tensor2tensor/issues/1591

jaidhyani 1200 days ago

Thanks, that's a really useful intuition!

VMG 1202 days ago

I have to agree. The article summary says

> Transformer block: Guesses the next word. It is formed by an attention block and a feedforward block.

But the diagram shows transformer blocks chained in sequence. So the next transformer block in the sequence would only receive a single word as the input? Does not make sense.

bomewish 1201 days ago

You seem to know a bunch about this. What’s your rec for best single explainer?

isaacfung 1201 days ago

Not the guy you asked, but these are often recommended.

https://jalammar.github.io/illustrated-transformer/

https://nlp.seas.harvard.edu/2018/04/03/attention.html

edge17 1201 days ago

Before going and digging into these, could you also explain what the necessary background is for this stuff to be meaningful?

In spite of having done a decent amount with neural networks, I'm a bit lost at how we suddenly got to what we're seeing now. It would be really helpful to understand the progression of things because I stepped away from this stuff for maybe 2 years and we seem to have crossed an ocean in the intervening time.

jaidhyani 1200 days ago

I am the guy asked and I endorse this guy's endorsements.

hansvm 1202 days ago

> 6

Selecting the likeliest token is only one of many sampling options, and it's extremely poor for most tasks, moreso when you consider the relationships between multiple executions of the model. _Some_ (not necessarily softmax) probability renormalization trained into the model is issential for a lot of techniques.

toxik 1201 days ago

To expand on this, one of the most common tricks is Nucleus sampling. Roughly, you zero out the lowest probabilities such that the remaining sum to just above some threshold you decide (often around 80%).

The idea is that this is more general than eg changing the temperature of the softmax, or using top-k where you just keep the k most probable outcomes.

Note that if you do Nucleus sampling (aka top-p) with the threshold p=0% you just pick the maximum likelihood estimate.

jaidhyani 1200 days ago

That's true, but they didn't go into any other applications in this explainer and were presenting it strictly as a next-word-predictor. If they are going to include final softmax, they should explain why it's useful. It would be improved by being simpler (skip softmax) or more comprehensive (present a use case for softmax), but complexity without reason is bad pedagogy.

antimora 1202 days ago

I am trying to learn more in depth. Could you suggest some good resource for learning transformers?

metanonsense 1201 days ago

When I first tried to understand transformers, I superficially understood most material, but I always felt that I did not really get it on a "I am able to build it and I understand why I am doing it" level. I struggled to get my fingers on what exactly I did not understand. I read the original paper, blog posts, and watched more videos than I care to admit.

The one source of information that made it click to me were chapters 159 to 163 of Sebastian Raschka's phenomenal "Intro to deep learning and generative models" course on youtube. https://www.youtube.com/playlist?list=PLTKMiZHVd_2KJtIXOW0zF...

TyrianPurple 1201 days ago

Sebastian Raschka's course is really good. Gone through it like three times.

indeedmug 1202 days ago

I found these resources to be helpful.

https://jalammar.github.io/illustrated-transformer/ This is a good illustration of the transformer and how the math works.

https://karpathy.ai/zero-to-hero.html If you want a deeper understanding of transform and how they fit in the whole picture of deep learning, this series is far and away the best resource I found. Karpathy goes into transformers by the sixth lecture, the previous lectures give a lot more context how deep learning works.

pankajdoharey 1201 days ago

I agree that Karpathy's YouTube video is an excellent resource for understanding Transformers from scratch. It provides a hands-on experience that can be particularly helpful for those who want to implement the models themselves. Here's the link to the video titled "Let's build GPT: from scratch, in code, spelled out": https://youtu.be/kCc8FmEb1nY

Additionally, for more comprehensive resources on Transformers, you may find these resources useful:

* The Illustrated Transformer by Jay Alammar: http://jalammar.github.io/illustrated-transformer/

* MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention: https://www.youtube.com/watch?v=ySEx_Bqxvvo

* Karpathy's course, Deep Learning and Generative Models (Lecture 6 covers Transformers): https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThs......

These resources cover different aspects of Transformers and can help you grasp the underlying concepts and mechanisms better.

jaidhyani 1200 days ago

I endorse all of this and will further endorse (probably as a follow-up once one has a basic grasp) "A Mathematical Framework for Transformer Circuits" which builds a lot of really useful ideas for understanding how and why transformers work and how to start getting a grasp on treating them as something other than magical black boxes.

https://transformer-circuits.pub/2021/framework/index.html

Buttons840 1202 days ago

I've been reading this paper with pseudocode for various transformers and finding it helfpul: https://arxiv.org/abs/2207.09238

"This document aims to be a self-contained, mathematically precise overview of transformer architectures and algorithms (not results). It covers what transformers are, how they are trained, what they are used for, their key architectural components, and a preview of the most prominent models."

quantisan 1202 days ago

this one's been mentioned a lot: Let's build GPT: from scratch, in code, spelled out. https://youtu.be/kCc8FmEb1nY

andai 1202 days ago

The whole playlist is fantastic: https://youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9Gv...

detrites 1201 days ago

This hour-long MIT lecture is very good, it builds from the ground up until transformers. MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention: https://youtube.com/watch?v=ySEx_Bqxvvo

mdp2021 1201 days ago

The uploads of the 2023 MIT 6.S191 course from Alexander Amini (et alii) is ongoing, periodical since mid March. (They published the lesson about Reinforcement Learning yesterday.)

andai 1202 days ago

Here's the original paper: https://arxiv.org/abs/1706.03762

jaidhyani 1200 days ago

The original paper is very good but I would argue it's not well optimized for pedagogy. Among other things, it's targeting a very specific application (translation) and in doing so adopts a more complicated architecture than most cutting-edge modes actually use (encoder-decoder instead of just one or the other). The writers of the paper probably didn't realize they were writing a foundational document at the time. It's good for understanding how certain conventions developed and important historically - but as someone who did read it as an intro to transformers, in retrospect I would have gone with other resources (e.g. "The Illustrated Transformer").

chaxor 1202 days ago

I know we don't have access to the details at OpenAI - but it does seem like there have been significant changes to the BPE token size over time. It seems there is a push towards much larger tokens than the previous ~3 char tokens (at least by behavior)

microtonal 1202 days ago

BPE is not set to a certain length, but a target vocabulary size. It starts with bytes (or characters) as the basic unit in which everything is split up and merges units iteratively (choosing the most frequent pairing) until the vocab size is reached. Even 'old' BPE models contain plenty of full tokens. E.g. RoBERTa:

https://huggingface.co/roberta-base/raw/main/merges.txt

(You have to scroll down a bit to get to the larger merges and image the lines without the spaces, which is what a string would look like after a merge.)

Also see GPT-2:

https://huggingface.co/gpt2/raw/main/merges.txt

I recently did some statistics. Average number of pieces per token (sampled on fairly large data, these are all models that use BBPE):

RoBERTa base (English): 1.08

RobBERT (Dutch): 1.21

roberta-base-ca-v2 (Catalan): 1.12

ukr-models/xlm-roberta-base-uk (Ukrainian): 1.68

In all these cases, the median token length in pieces was 1.

(Note: I am not debating that newer OpenAI models don't use a larger vocab. I just want to show that older BBPE models didn't use 3 char pieces. They were 1 piece per token for most tokens.)

montebicyclelo 1201 days ago

OpenAI have made their tokenizers public [1].

As someone has pointed out, with BPE you specify the vocab size, not the token size. It's a relatively simple algo, this Huggingface course does a nice job of explaining it [2]. Plus the original paper has a very readable Python example [3].

[1] https://github.com/openai/tiktoken

[2] https://huggingface.co/course/chapter6/5?fw=pt

[3] https://arxiv.org/abs/1508.07909

charcircuit 1202 days ago

>and very counterintuitively to me

It's more intuitive if you remember how many dimensions these vectors have.

Hendrikto 1201 days ago

> Skipping over BPE as part of tokenization

Well, there are other methods in use. See ByT5, for example.

oergiR 1201 days ago

I agree except for (6). A language model assigns probabilities to sequences. The model needs normalised distributions, eg using a softmax, so that’s the right way of thinking about it.

jaidhyani 1200 days ago

This is true in general but not in the use case they presented. If they had explained why a normalized distribution is useful it would have made sense - but they just describe this as pick-the-top-answer next-word predictor, which makes the softmax superfluous.