Hacker News new | ask | show | jobs
by TrueDuality 1146 days ago
Tokenization use one-hot encodings, so that matrix will always be n^2 the number of tokens. This has an impact on all the subsequent layers and final number of parameters. You want to use as information dense tokens as possible, while being able to represent weird or unseen tokens, but discrete enough to allow differentiation of concepts.