Hacker News new | ask | show | jobs
by chessgecko 719 days ago
*edit neverming below this is a character level model that probably has a small vocab so it wouldn’t make a massive difference

Is this taking into account the parameters in the embedding and the output ffn? Because normally when models are really small and the vocab is large they can account for an extremely large number of parameters and would explain why the optimal number of layers here is unusually small.

I suspect it isn’t being taken into account because doubling the embedding and cutting the number of layers in half only holds the parameters constant if you forget the embedding and output, but id need to see more details on the config (mainly the vocab size he used) to confirm.

1 comments

vocab size if like 56, not doing much