| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by _akhe 784 days ago

Thanks for clarifying, this is exactly where I was confused.

I just read about how both sentencepiece and tiktoken tokenize.

Thanks for making this (in JavaScript no less!) and putting it online! I'm going to use it in my auto-completion library (here: https://github.com/bennyschmidt/next-token-prediction/blob/m...) instead of just `.split(' ')` as I'm pretty sure it will be more nuanced :)

Awesome work!

1 comments

_akhe 783 days ago

Well I installed your npm and tried to integrate it, but no matter what every token is always " word" with a leading space, and it's isolating foreign symbols as standalone tokens. I tried different options to strip those or to not include preceding spaces but it's always that way. It's probably how llama3 tokenizes text but I can't get use out of it for my autocomplete library unfortunately. I would need more-or-less the tokens to be words or occasional phrases.

I really love that it is 0 deps and that you provided the npm, and would love to defer this part of my work to an efficient library like this.

belladoreai 783 days ago

I don't think I really understand your use case.

My library solves the following problem: how to tokenize text in a way that is compatible with llama3.

If you don't have any particular constraint (as in "tokenize text in a way that is compatible to model X"), then you can just write your own tokenization that tokenizes the text however you want. It doesn't really make sense to use a complicated tokenization scheme from some LLM model if you don't need to be compatible with that model.

If you really want each word to be its own token, you can easily do that by just splitting on whitespace and punctuation (though that will lead to a huge vocabulary).