| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bnjmn 586 days ago

> We still do not know, for instance, what makes a good tokeniser (Gowda and May, 2020; Cognetta et al., 2024): which characteristics should its produced subwords `s` have to be a good starting point for language modelling? If we knew this, then we could define an objective function which we could evaluate tokenisers with.

I don't see how the authors get past this true general statement from the first paragraph of the introduction. Finding a good tokenizer is not just NP-hard; we have no idea how hard it might be because we don't have theoretical agreement on what "good" means.

In order to have something to prove, the authors decide (somewhat arbitrarily):

> Specifically, we focus on finding tokenisers that maximise the compression of a text. Given this objective, we then define the tokenisation problem as the task of finding a tokeniser which compresses a dataset to at most δ symbols.

Is a tokenizer that maximizes the compression of text (e.g. by identifying longer tokens that tend to be used whole) necessarily a better tokenizer, in terms of overall model performance? Compression might be a useful property for an objective function to consider... but then again maybe not, if it makes the problem NP-hard.

I'm also not sure how realistic the limitation to "at most δ symbols" is. I mean, that limit is undeniably useful to make the proof of NP-completeness go through, because it's a similar mechanism to the minimum number of satisfied clauses in the MAX-2-SAT definition. But why not just keep adding tokens as needed, rather than imposing any preordained limit? IIRC OpenAI's tokenizer has a vocabulary of around 52k subword strings. When that tokenizer was being designed, I don't imagine they worried much if the final number had been 60k or even 100k. How could you possibly choose a meaningful δ from first principles?

To put that point a different way, imagine the authors had proven NP-completeness by reduction from the Knapsack Problem, where the knapsack you're packing has some maximum capacity. If you can easily swap your knapsack out for a larger knapsack whenever it gets (close to) full, then the problem becomes trivial.

If the authors managed to prove that any arbitrary objective function would lead to NP-hard tokenizer optimization problem, then their result would be more general. If the paper proves that somehow, I missed it.

I suppose this paper suggests "here be dragons" in an interesting if incomplete way, but I would also say there's no need to hurt yourself with an expensive optimization problem when you're not even sure it delivers the results you want.

5 comments

immibis 585 days ago

NP is a category of decision problems - problems with boolean answers. Saying that it's NP-complete to find the tokeniser that produces the fewest symbols is meaningless. You have to convert it to the form "is there a tokenizer that produces fewer than N symbols?" before it even makes sense to ask whether it's NP-complete.

link

bnjmn 585 days ago

I fully agree with your final statement, but needing to constrain the problem in an artificial way to prove it's NP-complete doesn't mean the constraint was justified or realistic, because then you've only proved the constrained version of the decision problem is NP-hard.

There might be plenty of perfectly "good" tokenizers (whatever that ends up meaning) that can be found or generated without formulating their design as an NP-complete decision problem. Claiming "tokenization is NP-complete" (paper title) in general seems like an overstatement.

link

immibis 585 days ago

If it's NP-hard to even know whether the answer is bigger or smaller than a certain number, then it's obvious that in a non-formal way, finding the exact answer is at least as hard as NP-hard, whatever that means.

link

cantpostcanceld 585 days ago

I don't know if this is related but I work on a project with a language parser and so a tokenizer and one issue we have is the parser/tokenier was not designed to work on incomplete code so it's not useful for auto complete.

That property of how good the design is for helping write and debug (useful error messages) seems like yet another metric that could be used

I also don't know if tokenizers can be divorced from parsers. Some languages have constructs that require context that comes from the parser

link

conartist6 584 days ago

CSTML and BABLR are designed to close this gap completely for people in your position.

We invented a new kind of zero just so that you could write arbitrary parsers and deal with things that are missing

link

Vecr 586 days ago

It's not optimal, but if you want to play it safe you can use 1 byte = 1 token, plus a few control tokens like stop, user turn, model turn, etc.

link

sureglymop 585 days ago

Can anyone explain to a layman like myself why it wouldn't be better if we just used bytes as tokens?

link

mcyc 585 days ago

It's two problems:

1) the sequence length increases too much. Idk what the average token length is for Llama, but imagine it's like 5+ bytes. Using individual bytes as tokens immediately makes the context 5x longer which is super bad for inference speed and memory requirements (since attention inference is quadratic in the length of the sequence).

2) individual bytes have essentially no meaning, so byte embeddings are harder to learn. Subword tokens aren't a perfect solution, but they definitely often have some standalone meaning where embeddings make sense.

I'll give another example from a recent paper that tries to eliminate tokenizers (this is a popular research direction) [1].

Figure 4 is a really good example of why byte-level models are wasting computation. Once part of a word is generated, most of the remaining bytes are assigned basically probability 1. But a byte-level model would still have to spend time decoding them. With a subword-level model most of these easy-to-decode bytes would be packed together in a single token so you don't have to decode them individually.

When model APIs bill by the token, this is an important consideration.

[1]: https://arxiv.org/abs/2412.09871

link

sureglymop 584 days ago

Thank you very much for the thorough reply! I highly appreciate it.

link

mcyc 586 days ago

Hi, I'm Cognetta from the above Cognetta et al. I can't answer all of your questions (and I can't speak for the authors of this paper ofc), but I will try to answer some.

> Is a tokenizer that maximizes the compression of text (e.g. by identifying longer tokens that tend to be used whole) necessarily a better tokenizer, in terms of overall model performance? Compression might be a useful property for an objective function to consider... but then again maybe not, if it makes the problem NP-hard.

Compression isn't necessarily the best metric for language modeling quality [1][2][3], but there are some papers that find a correlation between it and quality [4] and also it has one important benefit: it reduces inference time by making the input sequences shorter (this is particularly important for transformers, because the runtime is quadratic in the sequence length).

If you imagine that with enough data, basically any reasonable tokenization algorithm would be ok (I think this is mostly true; there are definitely bad and "better" tokenizers and you see this very clearly in small data settings, but once you get into the trillions-of-tokens and 10s-of-billion-of-parameters setting, other things are going to matter more), then optimizing the tokenizer for compression is a good choice as it will provide tangible, practical benefits in the sense of reduced inference time.

> I'm also not sure how realistic the limitation to "at most δ symbols" is. [...] But why not just keep adding tokens as needed, rather than imposing any preordained limit?

This is a pretty realistic limitation imo. Of course you can arbitrarily increase the vocabulary size, but there is a tradeoff between modeling quality, parameter count, and inference time. If you increase the vocabulary a bunch, your inference speed will probably improve (although now you have a much larger softmax at the end of your model, which isn't usually a bottleneck anymore, but still not great), parameter count will increase (due to the larger embedding table), and your modeling quality will go down (in that you have tokens which are so rare in the corpus that they are massively undertrained; this can cause big problems [5]).

So by constraining it to δ, you are basically setting a parameter budget for the vocabulary, and this is a pretty reasonable thing to do.

> IIRC OpenAI's tokenizer has a vocabulary of around 52k subword strings.

Yeah, the size of the vocabulary varies a lot across models, but it isn't unusual to see significantly larger vocabularies these days (e.g., gemma has ~256k). However, these are still finite and very small compared to the corpus size.

> How could you possibly choose a meaningful δ from first principles?

This is a really great question, and something that we don't know how to answer. A lot of work has tried to answer it [6][7], but it is very much an open question.

[1]: https://arxiv.org/abs/2310.08754

[2]: https://aclanthology.org/2023.acl-long.284/

[3]: https://aclanthology.org/2024.emnlp-main.40/

[4]: https://arxiv.org/abs/2403.06265

[5]: https://aclanthology.org/2024.emnlp-main.649/

[6]: https://aclanthology.org/2023.acl-long.284/

[7]: https://aclanthology.org/2020.findings-emnlp.352/

link

mcyc 585 days ago

NB: Can't edit my original reply.

Sorry actually I misread part of your comment in relation to the paper and confused δ and another parameter, K.

To clarify, δ is the number of tokens in the tokenized corpus and K is the size of the vocabulary.

So, if you are asking about why would they limit _K_, then my answer still applies (after swapping δ for K). But if you still mean "why do they pick some arbitrary δ as the limit of the size of the tokenized corpus", then I think the answer is just "because that makes it a decision problem".

link

bnjmn 585 days ago

Thanks for these detailed replies! Now I really want to read your paper.

link

mcyc 585 days ago

Thanks!

Our paper [1] is kind of a goofy adversarial thing where we thought "here's this cool metric, how can we break it?". The tokenizers we propose are definitely not tokenizers you should use in practice.

The original paper that proposes the metric is, imo, much more interesting theoretically [2].

[1]: https://aclanthology.org/2024.lrec-main.1469/

[2]: https://aclanthology.org/2023.acl-long.284/

link