| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by do_not_redeem 522 days ago
	So here is ChatGPT's token list: https://gist.github.com/s-macke/ae83f6afb89794350f8d9a1ad8a0... Is there some reason it isn't alphabetical? (More specifically, lexically sorted by codepoint) If you had a model with sorted tokens, you'd be able to solve this by constraining output to tokens with the desired prefix, probably with some mechanism similar to how this works: https://github.com/ggerganov/llama.cpp/blob/master/grammars/...

2 comments

pizza 522 days ago

They are sorted, but in a way such that, roughly speaking, rank is proportional to inverse frequency (like Zipf’s law, but also permitting merging of subwords to be ranked). This is actually extremely important because it makes the otherwise very-high-cardinality categorical feature of target predicted argmax vocab dictionary key index slightly smoother and slightly more predictable for the model

link

versteegen 522 days ago

516 instances of "\r\n"!

link