Why is the supply of tokens limited? Are they currently represented as 16 bit unsigned ints (I hear vocab size of about 50k for GPT3)? If so, is there a performance penalty for going to u32 beyond the extra memory?
Tokenization use one-hot encodings, so that matrix will always be n^2 the number of tokens. This has an impact on all the subsequent layers and final number of parameters. You want to use as information dense tokens as possible, while being able to represent weird or unseen tokens, but discrete enough to allow differentiation of concepts.