|
|
|
|
|
by spi
1037 days ago
|
|
Well it depends what you mean by “best” :-) removing the linear layer is the easiest solution (indeed you can’t remove the embedding one; in theory you could replace embedding + linear by one hot encoding + linear, adapting the input dimension or the linear layer to match your vocabulary size, but that would just be identical to embedding layer, just much slower and more memory hungry). Alternatively, you could indeed put a ReLU or other non linearity between embedding and linear, you get a different model with more layers and more parameters, as the given dataset is pretty large I’m quite sure this would bring an improvement to accuracy, but without testing it’s rather impossible to know. Normalisation also acts as some kind of non linearity, but when the author adds it that barely helps accuracy at all, so who knows, sometimes (often) neural networks are counter intuitive… |
|