|
|
|
|
|
by spi
1049 days ago
|
|
Kudos for the work! Stupid comment (not really on the main topic of the blogpost, but might be useful anyway for future "toy example" models): in the initial SimpleBrokenModel class [EDIT: and also in SimpleModel), there is actually quite a bit of wasted computation (something like > 66% of all the model computations!). You are applying, in sequence, the following layers: - embedding 65 -> 128 - linear 128 -> 128 - ReLU - linear 128 -> 65 But since there's no non-linearity at all between the first two layers, and they both are linear... the second one is totally useless. This model is effectively a "classical" single hidden layer MLP. And in terms of FLOPS, it's wasting 128128=16k operations out of a total of 128128+65*128=24k operations. |
|
So what's the best fix here? Adding a ReLU or SwiGLU between the embedding and first linear layer, or just deleting the linear? As presumably the embedding layer is required to convert token indexes to the embedding vector and you can't get rid of that, it has a special structure.