|
|
|
|
|
by Majromax
212 days ago
|
|
The basic MLP block in this model uses a ReLU^2 activation function (x <- ReLU(x)^2). That seems to be copied from the nanochat project, and it's not present in nanoGPT. Is there some documentation on the choice of this activation function? |
|