Hacker News new | ask | show | jobs
by mcyc 535 days ago
Hi, I'm Cognetta from the above Cognetta et al. I can't answer all of your questions (and I can't speak for the authors of this paper ofc), but I will try to answer some.

> Is a tokenizer that maximizes the compression of text (e.g. by identifying longer tokens that tend to be used whole) necessarily a better tokenizer, in terms of overall model performance? Compression might be a useful property for an objective function to consider... but then again maybe not, if it makes the problem NP-hard.

Compression isn't necessarily the best metric for language modeling quality [1][2][3], but there are some papers that find a correlation between it and quality [4] and also it has one important benefit: it reduces inference time by making the input sequences shorter (this is particularly important for transformers, because the runtime is quadratic in the sequence length).

If you imagine that with enough data, basically any reasonable tokenization algorithm would be ok (I think this is mostly true; there are definitely bad and "better" tokenizers and you see this very clearly in small data settings, but once you get into the trillions-of-tokens and 10s-of-billion-of-parameters setting, other things are going to matter more), then optimizing the tokenizer for compression is a good choice as it will provide tangible, practical benefits in the sense of reduced inference time.

> I'm also not sure how realistic the limitation to "at most δ symbols" is. [...] But why not just keep adding tokens as needed, rather than imposing any preordained limit?

This is a pretty realistic limitation imo. Of course you can arbitrarily increase the vocabulary size, but there is a tradeoff between modeling quality, parameter count, and inference time. If you increase the vocabulary a bunch, your inference speed will probably improve (although now you have a much larger softmax at the end of your model, which isn't usually a bottleneck anymore, but still not great), parameter count will increase (due to the larger embedding table), and your modeling quality will go down (in that you have tokens which are so rare in the corpus that they are massively undertrained; this can cause big problems [5]).

So by constraining it to δ, you are basically setting a parameter budget for the vocabulary, and this is a pretty reasonable thing to do.

> IIRC OpenAI's tokenizer has a vocabulary of around 52k subword strings.

Yeah, the size of the vocabulary varies a lot across models, but it isn't unusual to see significantly larger vocabularies these days (e.g., gemma has ~256k). However, these are still finite and very small compared to the corpus size.

> How could you possibly choose a meaningful δ from first principles?

This is a really great question, and something that we don't know how to answer. A lot of work has tried to answer it [6][7], but it is very much an open question.

[1]: https://arxiv.org/abs/2310.08754

[2]: https://aclanthology.org/2023.acl-long.284/

[3]: https://aclanthology.org/2024.emnlp-main.40/

[4]: https://arxiv.org/abs/2403.06265

[5]: https://aclanthology.org/2024.emnlp-main.649/

[6]: https://aclanthology.org/2023.acl-long.284/

[7]: https://aclanthology.org/2020.findings-emnlp.352/

1 comments

NB: Can't edit my original reply.

Sorry actually I misread part of your comment in relation to the paper and confused δ and another parameter, K.

To clarify, δ is the number of tokens in the tokenized corpus and K is the size of the vocabulary.

So, if you are asking about why would they limit _K_, then my answer still applies (after swapping δ for K). But if you still mean "why do they pick some arbitrary δ as the limit of the size of the tokenized corpus", then I think the answer is just "because that makes it a decision problem".

Thanks for these detailed replies! Now I really want to read your paper.
Thanks!

Our paper [1] is kind of a goofy adversarial thing where we thought "here's this cool metric, how can we break it?". The tokenizers we propose are definitely not tokenizers you should use in practice.

The original paper that proposes the metric is, imo, much more interesting theoretically [2].

[1]: https://aclanthology.org/2024.lrec-main.1469/

[2]: https://aclanthology.org/2023.acl-long.284/