Hacker News new | ask | show | jobs
by ajtejankar 914 days ago
The base model has 32 layers and there is a single linear layer for language modeling (going from embeddings to the vocabulary) that gets applied at the very end.