| HN Mirror

I don't think I really understand your use case.

My library solves the following problem: how to tokenize text in a way that is compatible with llama3.

If you don't have any particular constraint (as in "tokenize text in a way that is compatible to model X"), then you can just write your own tokenization that tokenizes the text however you want. It doesn't really make sense to use a complicated tokenization scheme from some LLM model if you don't need to be compatible with that model.

If you really want each word to be its own token, you can easily do that by just splitting on whitespace and punctuation (though that will lead to a huge vocabulary).