Hacker News new | ask | show | jobs
by mike_hearn 1046 days ago
For AI learners like me, here's an attempt to briefly explain some of the terms and concepts in this blog post, in the rough order they appear.

A token is a unique integer identifier for a piece of text. The simplest tokenization scheme is just Unicode where one character gets one integer, however LLMs have a limited number of token IDs available for use (the vocabulary), so a more common approach is to glue characters together into common fragments. This post just uses the subset of ASCII needed by TinyShakespeare.

The "loss function" is just a measure of how similar the model's prediction is to the ground truth. Lower loss = better predictions. Different tasks have different loss functions, e.g. edit distance might be one (but not a good one). During training you compute the loss and will generally visualize it on a chart. Whilst the line is heading downwards your NN is getting better, so you can keep training.

PyTorch is a library for working with neural networks and tensors. A tensor is either a single number (0 dimensions, a scalar), an array of numbers (1 dimension, a vector), or a multi-dimensional array of numbers where the 2-dimensional case is called a matrix. But a tensor can have any number of dimensions. PyTorch has a relatively large amount of magic going on in it via reflection and other things, so don't expect the code to make much intuitive sense. It's building a computation graph that can be later executed on the GPU (or CPU). The tutorial is easy to read!

A neural network is a set of neurons, each of which has a number called the bias, and connections between them each of which has an associated weight. Numbers (activations) flow from an input neuron through the connections whilst being adjusted by the weights to arrive at an output neuron, those numbers are then summed then multiplied by the bias before being emitted again to the next layer. The weights and biases are the network parameters and encode its knowledge.

A linear layer is a set of input neurons connected to a set of output neurons, where every input is connected to every output. It's one of the simplest kinds of neural network structure. If you ever saw a diagram of a neural network pre-2010 it probably looked like that. The size of the input and output layers can be different.

ReLU is an activation function. It's just Math.max(0, x) i.e. it sets all negative numbers to zero. These are placed on the outputs of a neuron and are one of those weird mathematical hacks where I can't really explain why it's needed, but introducing "kinks" in the function helps the network learn. Exactly what "kinks" work best is an open area of exploration and later the author will replace ReLU with a newer more complicated function.

Gradients are kind of numeric diffs computed during training that are used to update the model and make it more accurate.

Batch normalization is a way to process the numbers as they flow through the network, which helps the network learn better.

Positional encodings help the network understand the positions of tokens relative to each other, expressed in the form of a vector.

The `@` infix operator in Python is an alias for the __matmul__ method and is used as a shorthand for matrix multiplication (there are linear algebra courses on YouTube that are quite good if you want to learn this in more detail).

An epoch is a complete training run of the dataset. NNs need to be shown the data many times to fully learn, so you repeat the dataset. A batch is how many of the items in the dataset are fed to the network before updating the parameters. These sorts of numbers are called hyperparameters, because they're things you can fiddle with but the word parameters was already used for weights/biases.

Attention is the magic that makes LLMs work. There are good explanations elsewhere, but briefly it processes all the input tokens in parallel to compute some intermediate tensors, and those are then used in a second stage to emit a series of output tokens.

5 comments

One more for the list is that a lot of people don't know what "Karpathy" means unless they are in the field and have been reading papers.

It might be good to include context like "the science communicator/researcher, Andrej Karpathy" so that it is clearer that it is referring to a useful person to look at posts from.

Another learner here, one clarification that I think is useful even for beginners:

> A token is a unique integer identifier for a piece of text.

A token is a word fragment that's common enough to be useful on its own - for eg., "writing", "written", "writer" all have "writ", so "writ" would be an individual token, and "writer" might be tokenized as "writ" and "er".

An embedding is where the tokens get turned into unique numeric identifiers.

Tokens are also numbers in practice, but they're indexes into a lookup table of character sequences so yes there's very little between the two definitions. Embeddings are in turn the result of looking up that index in a table, and the result is a vector. So:

character sequence (string) -> token (small integer) -> embedding (vector of floats)

The tokens are in this case actually the individual characters:

    vocab = sorted(list(set(lines)))
>These are placed on the outputs of a neuron and are one of those weird mathematical hacks where I can't really explain why it's needed,

Because when you compose linear functions you get linear functions. So having linear everything is a waste of all layers but one.

In order for this not to happen, you need nonlinearity.

Thanks!
This is fantastic, thanks!

Any pointers / references / books that you’ve found particularly helpful in your learning journey?

I know about Karpathy’s video series (and accompanying repos). Anything else come to mind? Thanks!

I've been using a pretty random mix of things including the PyTorch tutorial, some of the tutorials on how transformers work that got posted here months ago, reading papers, and (of course) asking GPT4. It probably isn't the most efficient way to learn.

I would say that learning how to actually build NNs is likely not that important. What's far more important is to know how to use LLMs as an API or library, which is of course 1% coding because the API is so easy and 99% figuring out what their limits are, how best to integrate them into workflows, how to design textual "protocols" to communicate with the AI, how to test non-deterministic systems and so on. Learning how to train a model from scratch is fun but to get competitive results is too expensive, so pragmatism requires focus on being a user for now.

Use perplexity.ai . Why not use AI to learn AI! The good thing I like about this tool is that it gives citations, so that you can learn further beyond summarization it does.
Thank you! What is batch normalization doing and how does it help
There are other mechanisms for dealing with vanishing and exploding gradients. I (maybe wrongly?) think of batch normalization as being most distinctively about fighting internal covariate shift: https://machinelearning.wtf/terms/internal-covariate-shift/
Karpathy covers this in Makemore, but the tl;dr is that if you don’t normalize the batch (essentially center and scale your activations down to be normally distributed), then at gradient/backprop time, you may get values that are significantly smaller or greater than 1. This is a problem, because as you stack layers in sequence (passing outputs to inputs), the gradient compounds (because of the Chain Rule), and so what may have been a well behaved gradient at the end layers has either vanished (the upstream gradients were 0<x<1 at each layer) or exploded (the gradients were x>>1 upstream). Batch normalization helps control the vanishing/exploding gradient problem in deep neural nets by normalizing the values passed between layers.
got it,thanks
It's another one of those mathematical hacks that NNs love so much, which stops the numbers spiralling out of control in big networks.
folks thanks for the explanation