Hacker News new | ask | show | jobs
by do_not_redeem 522 days ago
So here is ChatGPT's token list: https://gist.github.com/s-macke/ae83f6afb89794350f8d9a1ad8a0...

Is there some reason it isn't alphabetical? (More specifically, lexically sorted by codepoint) If you had a model with sorted tokens, you'd be able to solve this by constraining output to tokens with the desired prefix, probably with some mechanism similar to how this works: https://github.com/ggerganov/llama.cpp/blob/master/grammars/...

2 comments

They are sorted, but in a way such that, roughly speaking, rank is proportional to inverse frequency (like Zipf’s law, but also permitting merging of subwords to be ranked). This is actually extremely important because it makes the otherwise very-high-cardinality categorical feature of target predicted argmax vocab dictionary key index slightly smoother and slightly more predictable for the model
516 instances of "\r\n"!