| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by WorldMaker 238 days ago
	The tokenizer is already a form of (somewhat lossy) compression of a string of plaintext to a stream of token identifiers. You can reason about Tokenizers/"embedding spaces" as a sort of massive "Dictionary Table/Dictionary Function" like you might use in a zip/gzip stream. Starting with already compressed data doesn't necessarily mean fewer tokens, you can probably assume similar entropy (or probably worse entropy) in expanding "Dictionary words" in a compressed stream versus "tokens" from a plaintext stream.