Hacker News new | ask | show | jobs
by akiselev 1146 days ago
Why is the supply of tokens limited? Are they currently represented as 16 bit unsigned ints (I hear vocab size of about 50k for GPT3)? If so, is there a performance penalty for going to u32 beyond the extra memory?
1 comments

Tokenization use one-hot encodings, so that matrix will always be n^2 the number of tokens. This has an impact on all the subsequent layers and final number of parameters. You want to use as information dense tokens as possible, while being able to represent weird or unseen tokens, but discrete enough to allow differentiation of concepts.