Hacker News new | ask | show | jobs
by WorldMaker 238 days ago
The tokenizer is already a form of (somewhat lossy) compression of a string of plaintext to a stream of token identifiers. You can reason about Tokenizers/"embedding spaces" as a sort of massive "Dictionary Table/Dictionary Function" like you might use in a zip/gzip stream.

Starting with already compressed data doesn't necessarily mean fewer tokens, you can probably assume similar entropy (or probably worse entropy) in expanding "Dictionary words" in a compressed stream versus "tokens" from a plaintext stream.