| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by spi 1049 days ago

Kudos for the work! Stupid comment (not really on the main topic of the blogpost, but might be useful anyway for future "toy example" models): in the initial SimpleBrokenModel class [EDIT: and also in SimpleModel), there is actually quite a bit of wasted computation (something like > 66% of all the model computations!). You are applying, in sequence, the following layers:

- embedding 65 -> 128

- linear 128 -> 128

- ReLU

- linear 128 -> 65

But since there's no non-linearity at all between the first two layers, and they both are linear... the second one is totally useless. This model is effectively a "classical" single hidden layer MLP. And in terms of FLOPS, it's wasting 128128=16k operations out of a total of 128128+65*128=24k operations.

1 comments

mike_hearn 1048 days ago

Seems I'm not the only one still getting to grips with non-linearity, lol (see discussion down-thread).

So what's the best fix here? Adding a ReLU or SwiGLU between the embedding and first linear layer, or just deleting the linear? As presumably the embedding layer is required to convert token indexes to the embedding vector and you can't get rid of that, it has a special structure.

link

spi 1048 days ago

Well it depends what you mean by “best” :-) removing the linear layer is the easiest solution (indeed you can’t remove the embedding one; in theory you could replace embedding + linear by one hot encoding + linear, adapting the input dimension or the linear layer to match your vocabulary size, but that would just be identical to embedding layer, just much slower and more memory hungry).

Alternatively, you could indeed put a ReLU or other non linearity between embedding and linear, you get a different model with more layers and more parameters, as the given dataset is pretty large I’m quite sure this would bring an improvement to accuracy, but without testing it’s rather impossible to know. Normalisation also acts as some kind of non linearity, but when the author adds it that barely helps accuracy at all, so who knows, sometimes (often) neural networks are counter intuitive…

link

mike_hearn 1047 days ago

Why does adding a ReLU create more layers and parameters? Isn't the total number of neurons the same?

link

hansvm 1046 days ago

The representational capacity of two consecutive linear layers is the same as one slightly different linear layer. The capacity when you introduce a relu into the mix is (up to a complexity defined by the number of parameters) any "nice" function -- including things like e^sin(x) -- not just linear functions. With two consecutive linear layers many of the weights and computations are redundant.

link

mike_hearn 1046 days ago

Right, I get that: it increases learning capacity, but doesn't introduce more parameters? Like the GPU requirements would be the same beyond the extra cost of the ReLU operation itself, yes?

link

spi 1044 days ago

Yes of course, sorry my write-up was confusing: I meant that "adding a ReLU between the two linear layers" (the second option) would result in more parameters than "directly removing the second linear layer" (the first option). And my message just meant "I don't know which of the two options achieves the best trade-off between speed and quality". I didn't consider the option "leave it as it is in the blog post" because it is essentially equivalent to the first option (removing the linear layer) but slower (as you say, with exactly the same number of parameters as the second option), so it definitely shouldn't be a "best" option.

link