| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mike_hearn 1046 days ago
	Seems I'm not the only one still getting to grips with non-linearity, lol (see discussion down-thread). So what's the best fix here? Adding a ReLU or SwiGLU between the embedding and first linear layer, or just deleting the linear? As presumably the embedding layer is required to convert token indexes to the embedding vector and you can't get rid of that, it has a special structure.

1 comments

spi 1045 days ago

Well it depends what you mean by “best” :-) removing the linear layer is the easiest solution (indeed you can’t remove the embedding one; in theory you could replace embedding + linear by one hot encoding + linear, adapting the input dimension or the linear layer to match your vocabulary size, but that would just be identical to embedding layer, just much slower and more memory hungry).

Alternatively, you could indeed put a ReLU or other non linearity between embedding and linear, you get a different model with more layers and more parameters, as the given dataset is pretty large I’m quite sure this would bring an improvement to accuracy, but without testing it’s rather impossible to know. Normalisation also acts as some kind of non linearity, but when the author adds it that barely helps accuracy at all, so who knows, sometimes (often) neural networks are counter intuitive…

link

mike_hearn 1044 days ago

Why does adding a ReLU create more layers and parameters? Isn't the total number of neurons the same?

link

hansvm 1044 days ago

The representational capacity of two consecutive linear layers is the same as one slightly different linear layer. The capacity when you introduce a relu into the mix is (up to a complexity defined by the number of parameters) any "nice" function -- including things like e^sin(x) -- not just linear functions. With two consecutive linear layers many of the weights and computations are redundant.

link

mike_hearn 1044 days ago

Right, I get that: it increases learning capacity, but doesn't introduce more parameters? Like the GPU requirements would be the same beyond the extra cost of the ReLU operation itself, yes?

link

spi 1041 days ago

Yes of course, sorry my write-up was confusing: I meant that "adding a ReLU between the two linear layers" (the second option) would result in more parameters than "directly removing the second linear layer" (the first option). And my message just meant "I don't know which of the two options achieves the best trade-off between speed and quality". I didn't consider the option "leave it as it is in the blog post" because it is essentially equivalent to the first option (removing the linear layer) but slower (as you say, with exactly the same number of parameters as the second option), so it definitely shouldn't be a "best" option.

link

mike_hearn 1039 days ago

Thank you!

link